Boosting Deep Neural Network Efficiency with Dual-Module Inference
Authors: Liu Liu, Lei Deng, Zhaodong Chen, Yuke Wang, Shuangchen Li, Jingwei Zhang, Yihua Yang, Zhenyu Gu, Yufei Ding, Yuan Xie
ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We theoretically and experimentally provide evidence that the noise caused by pruning and quantization in these regions is less influential. This observation motivates us that more aggressive pruning and quantization can be applied to these insensitive regions. As listed in Table 1, we report the PPL on the testing set and the average cosine similarity between the activations of the baseline model and the noise-introduced model. Our method is evaluated on CPU-based server platform (Intel(R) Xeon(R) CPU E5-2698 v4) as most inference workloads run on CPUs (Park et al., 2018). |
| Researcher Affiliation | Collaboration | 1Department of Computer Science, University of California, Santa Barbara 2Department of Electrical and Computer Engineering, University of California, Santa Barbara 3Alibaba Group. Correspondence to: Liu Liu <liu liu@ucsb.edu>, Lei Deng <leideng@ucsb.edu>. |
| Pseudocode | Yes | Algorithm 1 Dual-Module Fine-tuning Algorithm and Algorithm 2 Dual-Module Inference Algorithm are provided in Section 3.4. |
| Open Source Code | No | The paper does not provide concrete access to source code, such as a specific repository link or an explicit code release statement. |
| Open Datasets | Yes | We train the little module while freezing the parameters of the big module, and we use the same training set and validation set to run the SGD optimization. Our implementations of LSTMs/GRUs are adapted from the word-level language modeling example from Py Torch using the same hyper-parameters to train baseline models. We report the word-level perplexity (PPL) as the measure of model quality. We further investigate Neural Machine Translation (NMT)...Our experiments show the de-tokenized BLEU score to measure the model quality on the public WMT16 English-German dataset. Res Net-18 is used for Image Net classification. |
| Dataset Splits | Yes | We train the little module while freezing the parameters of the big module, and we use the same training set and validation set to run the SGD optimization. The optimal ϵ and β can be obtained experimentally on validation set. We can simply assign a constant value to θth and tune it on validation set. K can be a hyper-parameter tuned on validation set. |
| Hardware Specification | Yes | Our method is evaluated on CPU-based server platform (Intel(R) Xeon(R) CPU E5-2698 v4) as most inference workloads run on CPUs (Park et al., 2018). |
| Software Dependencies | Yes | The baseline implementation is the Py Torch CPU version with Intel MKL (version 2019.4) as the back-end BLAS kernel library. |
| Experiment Setup | Yes | We vary the insensitive ratio to show the quality-performance trade-off; the larger insensitive ratio indicates more results are from the little module and less memory overhead to perform the big module. We conduct experiments on language modeling using a single-layer LSTM of 1500 hidden units. We quantize the little module to INT8 and reduce the hidden dimension from 1500 to three different levels... We fix the insensitive ratio at 50% across this set of experiments. We show the impact of different quantization levels on the model quality and the parameter size. |