ZeroQuant: Efficient and Affordable Post-Training Quantization for Large-Scale Transformers

Authors: Zhewei Yao, Reza Yazdani Aminabadi, Minjia Zhang, Xiaoxia Wu, Conglong Li, Yuxiong He

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our empirical results show that: Zero Quant enables quantizing BERT and GPT-3-style models into INT8 weight and activations to retain accuracy without incurring any retraining cost. Compared to FP16 inference, our INT8 model achieves up to 5.19x/4.16x speedup on BERTbase/GPT-3350M on A100 GPUs.
Researcher Affiliation Industry Microsoft {zheweiyao, yazdani.reza, minjiaz, xiaoxiawu, conglong.li, yuxhe}@microsoft.com
Pseudocode No The paper includes architectural diagrams (Figure 1, Figure 3) but does not contain any structured pseudocode or algorithm blocks.
Open Source Code Yes Code is released as a part of https://github.com/microsoft/Deep Speed
Open Datasets Yes For BERT, we tested both BERTbase and BERTlarge on GLUE benchmark; and for GPT-3-style models, we tested the GPT-3350M (i.e., GPT-3-style model with 350M parameters) and GPT-31.3B (i.e., GPT-3-style model with 1.3B parameters) on 20 zero-shot evaluation tasks, including 19 accuracybased tasks and 1 language modeling generation task. To illustrate the scalability of the proposed Zero Quant, we also directly apply it to two of the largest open-sourced GPT-3-style models, i.e., GPT-J6B [67] and GPT-Neo X20B [5]. The results are shown in Table 10. Compared to Zero Quant, LKD using random data can boost the accuracy by 1.1% and reduce the PPL from 92.1 to 40.6. The reason why random data can still significantly improve the performance is that LKD does not optimize the end-to-end pipeline and it only layer-by-layer learns the internal dependency from the teacher model. Therefore, random data can also provide meaningful information. Using Wikipedia data from Huggingface can further improve the accuracy to 36.2 and reduce the PPL to 30.4, which is comparable to the results using the original data. This indicates that a clean text dataset can be used for LKD when we do not have access to the original full dataset.
Dataset Splits Yes BERT models are evaluated on the development set of GLUE benchmark (except WNLI).
Hardware Specification Yes Compared to FP16 inference, our INT8 model achieves up to 5.19x/4.16x speedup on BERTbase/GPT-3350M on A100 GPUs. We compare the inference speed of BERT between FP16 and our INT8 versions in Table 6 on a single 40G-A100 GPU.
Software Dependencies No The paper mentions software components like 'CUTLASS INT8 Ge MM' and 'Py Torch quantization suite', but does not provide specific version numbers for these or other software dependencies.
Experiment Setup Yes We use a fixed set of hyperparameters for all the LKD-related experiments even though tuning them may benefit our results. Please see Appendix A.2 for more training details and see Appendix A.3 for the reported metrics for BERT. For LKD, we used a learning rate of 1e-4 with 10 iterations on a batch size of 256 for BERT models and 10 iterations on a batch size of 1 for GPT-3 models.