Towards Efficient Post-training Quantization of Pre-trained Language Models
Authors: Haoli Bai, Lu Hou, Lifeng Shang, Xin Jiang, Irwin King, Michael R Lyu
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on GLUE and SQu AD benchmarks show that our proposed PTQ solution not only performs close to QAT, but also enjoys significant reductions in training time, memory overhead, and data consumption. |
| Researcher Affiliation | Collaboration | Haoli Bai1,2 , Lu Hou1, Lifeng Shang1, Xin Jiang1, Irwin King2, Michael R. Lyu2 1Huawei Noah s Ark Lab, 2The Chinese University of Hong Kong {baihaoli,houlu3,Shang.Lifeng,Jiang.Xin}@huawei.com, {king,lyu}@cse.cuhk.edu.hk |
| Pseudocode | Yes | Algorithm 1 Efficient PTQ for PLMs. Algorithm 2 MREM algorithm. |
| Open Source Code | No | The paper does not provide a direct link to a source code repository or an explicit statement about the code being made publicly available in the main body. The checklist mentions code availability but does not provide a URL or reference in the main text. |
| Open Datasets | Yes | We evaluate post-training quantization on both the GLUE [45], and SQu AD benchmarks [39]. |
| Dataset Splits | Yes | We use the same evaluation metrics in [12, 56] for the development set of GLUE and SQu AD benchmarks. For results in Section 4.2, we report accuracies on both the matched section and mis-matched sections of MNLI, and EM (exact match) and F1 score for SQu AD. |
| Hardware Specification | Yes | The training time and memory in (a) and (b) are measured by 4-bit weights and 8-bit activations (i.e., W4A8) on an NVIDIA V100 GPU. ...By default, we partition the model into 4 modules on 4 NVIDIA-V100 GPUs. |
| Software Dependencies | No | Our implementation is based on Mind Spore [1]. The version number for Mind Spore or any other software dependencies is not specified. |
| Experiment Setup | Yes | For each module, we train for 2, 000 steps with an initial learning rate of 1e-4 on GLUE tasks, and 4, 000 steps with an initial learning rate of 5e-5 on SQu AD datasets. The learning rate decays linearly as done in [24, 56]. By default, we partition the model into 4 modules on 4 NVIDIA-V100 GPUs. |