Outlier Suppression: Pushing the Limit of Low-bit Transformer Language Models
Authors: Xiuying Wei, Yunchen Zhang, Xiangguo Zhang, Ruihao Gong, Shanghang Zhang, Qi Zhang, Fengwei Yu, Xianglong Liu
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments prove that our framework surpasses the existing works and, for the first time, pushes the 6-bit posttraining BERT quantization to the full-precision (FP) level. |
| Researcher Affiliation | Collaboration | Xiuying Wei1, 2 , Yunchen Zhang2, 4 , Xiangguo Zhang2 , Ruihao Gong1, 2, Shanghang Zhang3 , Qi Zhang2 , Fengwei Yu2 , Xianglong Liu1 1State Key Lab of Software Development Environment, Beihang University 2Sense Time Research, 3Peking University 4University of Electronic Science and Technology of China |
| Pseudocode | No | The paper contains a 'Flow diagram' in Figure 4 but no explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our code is available at https://github.com/wimh966/outlier_suppression. |
| Open Datasets | Yes | On the whole, we evaluate GLUE benchmark [33], SQu AD [34, 35], and XSum [36] and CNN/Daily Mail [37] across BERT, Ro BERTa, and BART models. |
| Dataset Splits | Yes | For PTQ, equipping our framework, we use 256 samples to calibrate the model. For training, hyper-parameters like learning rate are searched both for our methods and baseline techniques for fair comparisons. Details see Appendix F. (Table 4 and 5 also show MNLI acc m/mm, indicating matched/mismatched validation set accuracies common in GLUE). |
| Hardware Specification | No | The main paper does not provide specific hardware details (e.g., GPU/CPU models, memory amounts) used for running the experiments. While the author checklist indicates this information is provided, it is not present in the main body of the paper. |
| Software Dependencies | No | The paper mentions combining methods with 'LSQ+ [12]' and takes schemes from 'Faster Transformer [38]', but it does not specify software components with version numbers (e.g., PyTorch version, specific library versions). |
| Experiment Setup | Yes | For PTQ, equipping our framework, we use 256 samples to calibrate the model. For training, hyper-parameters like learning rate are searched both for our methods and baseline techniques for fair comparisons. Details see Appendix F. (Also, 'Here, 4-4-4 presents 4-bit weight, embedding, and activation.' and 'For the percentile, we search the hyper-parameter in [0.999, 0.9999, 0.99999] and report the best on dev set.') |