Towards Efficient Post-training Quantization of Pre-trained Language Models

Authors: Haoli Bai, Lu Hou, Lifeng Shang, Xin Jiang, Irwin King, Michael R Lyu

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on GLUE and SQu AD benchmarks show that our proposed PTQ solution not only performs close to QAT, but also enjoys significant reductions in training time, memory overhead, and data consumption.
Researcher Affiliation Collaboration Haoli Bai1,2 , Lu Hou1, Lifeng Shang1, Xin Jiang1, Irwin King2, Michael R. Lyu2 1Huawei Noah s Ark Lab, 2The Chinese University of Hong Kong {baihaoli,houlu3,Shang.Lifeng,Jiang.Xin}@huawei.com, {king,lyu}@cse.cuhk.edu.hk
Pseudocode Yes Algorithm 1 Efficient PTQ for PLMs. Algorithm 2 MREM algorithm.
Open Source Code No The paper does not provide a direct link to a source code repository or an explicit statement about the code being made publicly available in the main body. The checklist mentions code availability but does not provide a URL or reference in the main text.
Open Datasets Yes We evaluate post-training quantization on both the GLUE [45], and SQu AD benchmarks [39].
Dataset Splits Yes We use the same evaluation metrics in [12, 56] for the development set of GLUE and SQu AD benchmarks. For results in Section 4.2, we report accuracies on both the matched section and mis-matched sections of MNLI, and EM (exact match) and F1 score for SQu AD.
Hardware Specification Yes The training time and memory in (a) and (b) are measured by 4-bit weights and 8-bit activations (i.e., W4A8) on an NVIDIA V100 GPU. ...By default, we partition the model into 4 modules on 4 NVIDIA-V100 GPUs.
Software Dependencies No Our implementation is based on Mind Spore [1]. The version number for Mind Spore or any other software dependencies is not specified.
Experiment Setup Yes For each module, we train for 2, 000 steps with an initial learning rate of 1e-4 on GLUE tasks, and 4, 000 steps with an initial learning rate of 5e-5 on SQu AD datasets. The learning rate decays linearly as done in [24, 56]. By default, we partition the model into 4 modules on 4 NVIDIA-V100 GPUs.