Outlier Suppression: Pushing the Limit of Low-bit Transformer Language Models

Authors: Xiuying Wei, Yunchen Zhang, Xiangguo Zhang, Ruihao Gong, Shanghang Zhang, Qi Zhang, Fengwei Yu, Xianglong Liu

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments prove that our framework surpasses the existing works and, for the first time, pushes the 6-bit posttraining BERT quantization to the full-precision (FP) level.
Researcher Affiliation Collaboration Xiuying Wei1, 2 , Yunchen Zhang2, 4 , Xiangguo Zhang2 , Ruihao Gong1, 2, Shanghang Zhang3 , Qi Zhang2 , Fengwei Yu2 , Xianglong Liu1 1State Key Lab of Software Development Environment, Beihang University 2Sense Time Research, 3Peking University 4University of Electronic Science and Technology of China
Pseudocode No The paper contains a 'Flow diagram' in Figure 4 but no explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes Our code is available at https://github.com/wimh966/outlier_suppression.
Open Datasets Yes On the whole, we evaluate GLUE benchmark [33], SQu AD [34, 35], and XSum [36] and CNN/Daily Mail [37] across BERT, Ro BERTa, and BART models.
Dataset Splits Yes For PTQ, equipping our framework, we use 256 samples to calibrate the model. For training, hyper-parameters like learning rate are searched both for our methods and baseline techniques for fair comparisons. Details see Appendix F. (Table 4 and 5 also show MNLI acc m/mm, indicating matched/mismatched validation set accuracies common in GLUE).
Hardware Specification No The main paper does not provide specific hardware details (e.g., GPU/CPU models, memory amounts) used for running the experiments. While the author checklist indicates this information is provided, it is not present in the main body of the paper.
Software Dependencies No The paper mentions combining methods with 'LSQ+ [12]' and takes schemes from 'Faster Transformer [38]', but it does not specify software components with version numbers (e.g., PyTorch version, specific library versions).
Experiment Setup Yes For PTQ, equipping our framework, we use 256 samples to calibrate the model. For training, hyper-parameters like learning rate are searched both for our methods and baseline techniques for fair comparisons. Details see Appendix F. (Also, 'Here, 4-4-4 presents 4-bit weight, embedding, and activation.' and 'For the percentile, we search the hyper-parameter in [0.999, 0.9999, 0.99999] and report the best on dev set.')