Automatic Mixed-Precision Quantization Search of BERT
Authors: Changsheng Zhao, Ting Hua, Yilin Shen, Qian Lou, Hongxia Jin
IJCAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive evaluations on BERT downstream tasks reveal that our proposed method outperforms baselines by providing the same performance with much smaller model size.Extensive experimental validation on various NLP tasks. We evaluate the proposed AQ-BERT on four NLP tasks, including Sentiment Classification, Question answer, Natural Language Inference, and Named Entity Recognition. |
| Researcher Affiliation | Industry | Changsheng Zhao , Ting Hua , Yilin Shen , Qian Lou , Hongxia Jin Samsung Research America {changsheng.z, ting.hua, yilin.shen, qian.lou, hongxia.jin}@samsung.com |
| Pseudocode | Yes | Algorithm 1: The Procedure of AQ-BERT |
| Open Source Code | No | Our implementation is based on transformers by huggingface1. The Adam W optimizer is set with learning rate 2e 5, and SGD is set with learning rate 0.1 for architecture optimization. (footnote 1 points to https://github.com/huggingface/transformers) |
| Open Datasets | Yes | We evaluate our proposed AQ-BERT and other baselines (bert-base, Q-BERT, and Distilbert-base) on four NLP tasks: SST-2, MNLI, Co NLL-2003, and SQu AD. |
| Dataset Splits | Yes | Input: training set Dtrain and validation set Dval (in Algorithm 1) and Calculate Lval on Dval via Equation 14 to update bit assignments O (in Algorithm 1). |
| Hardware Specification | No | The paper does not specify any hardware used for experiments. |
| Software Dependencies | No | Our implementation is based on transformers by huggingface1. The Adam W optimizer is set with learning rate 2e 5, and SGD is set with learning rate 0.1 for architecture optimization. No version numbers for "transformers", "Adam W" or "SGD" are given. |
| Experiment Setup | Yes | The Adam W optimizer is set with learning rate 2e 5, and SGD is set with learning rate 0.1 for architecture optimization. Both Q-BERT and our method are using 8-bits activation. All model sizes reported here exclude the embedding layer, as we uniformly quantized embedding by 8-bit. |