Automatic Mixed-Precision Quantization Search of BERT

Authors: Changsheng Zhao, Ting Hua, Yilin Shen, Qian Lou, Hongxia Jin

IJCAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive evaluations on BERT downstream tasks reveal that our proposed method outperforms baselines by providing the same performance with much smaller model size.Extensive experimental validation on various NLP tasks. We evaluate the proposed AQ-BERT on four NLP tasks, including Sentiment Classification, Question answer, Natural Language Inference, and Named Entity Recognition.
Researcher Affiliation Industry Changsheng Zhao , Ting Hua , Yilin Shen , Qian Lou , Hongxia Jin Samsung Research America {changsheng.z, ting.hua, yilin.shen, qian.lou, hongxia.jin}@samsung.com
Pseudocode Yes Algorithm 1: The Procedure of AQ-BERT
Open Source Code No Our implementation is based on transformers by huggingface1. The Adam W optimizer is set with learning rate 2e 5, and SGD is set with learning rate 0.1 for architecture optimization. (footnote 1 points to https://github.com/huggingface/transformers)
Open Datasets Yes We evaluate our proposed AQ-BERT and other baselines (bert-base, Q-BERT, and Distilbert-base) on four NLP tasks: SST-2, MNLI, Co NLL-2003, and SQu AD.
Dataset Splits Yes Input: training set Dtrain and validation set Dval (in Algorithm 1) and Calculate Lval on Dval via Equation 14 to update bit assignments O (in Algorithm 1).
Hardware Specification No The paper does not specify any hardware used for experiments.
Software Dependencies No Our implementation is based on transformers by huggingface1. The Adam W optimizer is set with learning rate 2e 5, and SGD is set with learning rate 0.1 for architecture optimization. No version numbers for "transformers", "Adam W" or "SGD" are given.
Experiment Setup Yes The Adam W optimizer is set with learning rate 2e 5, and SGD is set with learning rate 0.1 for architecture optimization. Both Q-BERT and our method are using 8-bits activation. All model sizes reported here exclude the embedding layer, as we uniformly quantized embedding by 8-bit.