Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Outlier Suppression: Pushing the Limit of Low-bit Transformer Language Models
Authors: Xiuying Wei, Yunchen Zhang, Xiangguo Zhang, Ruihao Gong, Shanghang Zhang, Qi Zhang, Fengwei Yu, Xianglong Liu
NeurIPS 2022 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments prove that our framework surpasses the existing works and, for the first time, pushes the 6-bit posttraining BERT quantization to the full-precision (FP) level. |
| Researcher Affiliation | Collaboration | Xiuying Wei1, 2 , Yunchen Zhang2, 4 , Xiangguo Zhang2 , Ruihao Gong1, 2, Shanghang Zhang3 , Qi Zhang2 , Fengwei Yu2 , Xianglong Liu1 1State Key Lab of Software Development Environment, Beihang University 2Sense Time Research, 3Peking University 4University of Electronic Science and Technology of China |
| Pseudocode | No | The paper contains a 'Flow diagram' in Figure 4 but no explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our code is available at https://github.com/wimh966/outlier_suppression. |
| Open Datasets | Yes | On the whole, we evaluate GLUE benchmark [33], SQu AD [34, 35], and XSum [36] and CNN/Daily Mail [37] across BERT, Ro BERTa, and BART models. |
| Dataset Splits | Yes | For PTQ, equipping our framework, we use 256 samples to calibrate the model. For training, hyper-parameters like learning rate are searched both for our methods and baseline techniques for fair comparisons. Details see Appendix F. (Table 4 and 5 also show MNLI acc m/mm, indicating matched/mismatched validation set accuracies common in GLUE). |
| Hardware Specification | No | The main paper does not provide specific hardware details (e.g., GPU/CPU models, memory amounts) used for running the experiments. While the author checklist indicates this information is provided, it is not present in the main body of the paper. |
| Software Dependencies | No | The paper mentions combining methods with 'LSQ+ [12]' and takes schemes from 'Faster Transformer [38]', but it does not specify software components with version numbers (e.g., PyTorch version, specific library versions). |
| Experiment Setup | Yes | For PTQ, equipping our framework, we use 256 samples to calibrate the model. For training, hyper-parameters like learning rate are searched both for our methods and baseline techniques for fair comparisons. Details see Appendix F. (Also, 'Here, 4-4-4 presents 4-bit weight, embedding, and activation.' and 'For the percentile, we search the hyper-parameter in [0.999, 0.9999, 0.99999] and report the best on dev set.') |