Q-BERT: Hessian Based Ultra Low Precision Quantization of BERT

Authors: Sheng Shen, Zhen Dong, Jiayu Ye, Linjian Ma, Zhewei Yao, Amir Gholami, Michael W. Mahoney, Kurt Keutzer8815-8821

AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We extensively test our proposed method on BERT downstream tasks of SST-2, MNLI, Co NLL-03, and SQu AD. We can achieve comparable performance to baseline with at most 2.3% performance degradation, even with ultra-low precision quantization down to 2 bits
Researcher Affiliation Collaboration 1University of California at Berkeley, {sheng.s, zhendong, yejiayu, linjian, zheweiy, amirgh, mahoneymw, keutzer}@berkeley.edu. Equal contribution. Work done while interning at Wave Computing.
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code No The paper does not provide concrete access to source code for the methodology described.
Open Datasets Yes We extensively test our proposed method on BERT downstream tasks of SST-2, MNLI, Co NLL-03, and SQu AD. Details of the datasets are shown in Appendix. (These are well-known benchmark datasets.)
Dataset Splits No The paper mentions evaluating on the "development set" and using "10% of the entire training dataset" for Hessian calculation, but it does not specify the train/validation/test splits (e.g., percentages or exact counts) for the main experiment evaluation.
Hardware Specification No The paper does not provide specific hardware details used for running its experiments. It mentions "academic computational resources" but no specific models or specifications.
Software Dependencies No The paper does not provide specific ancillary software details with version numbers.
Experiment Setup Yes To set mixed precision to each encoder layer of BERTBASE, we measure the sensitivity based on Eq. 2... We then perform quantization-aware finetuning based on the selected precision setting. All experiments in Fig. 1 are based on 10 runs and each run uses 10% of the entire training dataset. ...all the models except for Baseline are using 8-bits activation. ...we used 128 groups for both Q-BERT and Q-BERTMP in Sec. 3.1.