reproducibilityindex.ai

Q-BERT: Hessian Based Ultra Low Precision Quantization of BERT

Authors: Sheng Shen, Zhen Dong, Jiayu Ye, Linjian Ma, Zhewei Yao, Amir Gholami, Michael W. Mahoney, Kurt Keutzer8815-8821

AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We extensively test our proposed method on BERT downstream tasks of SST-2, MNLI, Co NLL-03, and SQu AD. We can achieve comparable performance to baseline with at most 2.3% performance degradation, even with ultra-low precision quantization down to 2 bits
Researcher Affiliation	Collaboration	1University of California at Berkeley, {sheng.s, zhendong, yejiayu, linjian, zheweiy, amirgh, mahoneymw, keutzer}@berkeley.edu. Equal contribution. Work done while interning at Wave Computing.
Pseudocode	No	The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code	No	The paper does not provide concrete access to source code for the methodology described.
Open Datasets	Yes	We extensively test our proposed method on BERT downstream tasks of SST-2, MNLI, Co NLL-03, and SQu AD. Details of the datasets are shown in Appendix. (These are well-known benchmark datasets.)
Dataset Splits	No	The paper mentions evaluating on the "development set" and using "10% of the entire training dataset" for Hessian calculation, but it does not specify the train/validation/test splits (e.g., percentages or exact counts) for the main experiment evaluation.
Hardware Specification	No	The paper does not provide specific hardware details used for running its experiments. It mentions "academic computational resources" but no specific models or specifications.
Software Dependencies	No	The paper does not provide specific ancillary software details with version numbers.
Experiment Setup	Yes	To set mixed precision to each encoder layer of BERTBASE, we measure the sensitivity based on Eq. 2... We then perform quantization-aware ﬁnetuning based on the selected precision setting. All experiments in Fig. 1 are based on 10 runs and each run uses 10% of the entire training dataset. ...all the models except for Baseline are using 8-bits activation. ...we used 128 groups for both Q-BERT and Q-BERTMP in Sec. 3.1.