SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models
Authors: Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, Song Han
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate up to 1.56 speedup and 2 memory reduction for LLMs with negligible loss in accuracy. Smooth Quant enables serving 530B LLM within a single node. Our work offers a turn-key solution that reduces hardware costs and democratizes LLMs. |
| Researcher Affiliation | Collaboration | 1Massachusetts Institute of Technology 2NVIDIA. Correspondence to: Guangxuan Xiao <xgx@mit.edu>, Ji Lin <jilin@mit.edu>. |
| Pseudocode | No | The paper contains figures and equations describing the method, but no explicit pseudocode or algorithm blocks are provided. |
| Open Source Code | Yes | https://github.com/mit-han-lab/smoothquant |
| Open Datasets | Yes | We use seven zero-shot evaluation tasks: LAMBADA (Paperno et al., 2016), Hella Swag (Zellers et al., 2019), PIQA (Bisk et al., 2020), Wino Grande (Sakaguchi et al., 2019), Open Book QA (Mihaylov et al., 2018), RTE (Wang et al., 2018), COPA (Roemmele et al., 2011), and one language modeling dataset Wiki Text (Merity et al., 2016) to evaluate the OPT and BLOOM models. We use MMLU (Hendrycks et al., 2020), MNLI (Williams et al., 2018), QNLI (Wang et al., 2018) and LAMBADA to evaluate the GLM-130B model... |
| Dataset Splits | Yes | We get a suitable α by running a quick grid search on a subset of the Pile (Gao et al., 2020) validation set. ... We use seven zero-shot evaluation tasks: LAMBADA (Paperno et al., 2016), Hella Swag (Zellers et al., 2019), PIQA (Bisk et al., 2020), Wino Grande (Sakaguchi et al., 2019), Open Book QA (Mihaylov et al., 2018), RTE (Wang et al., 2018), COPA (Roemmele et al., 2011), and one language modeling dataset Wiki Text (Merity et al., 2016) to evaluate the OPT and BLOOM models. |
| Hardware Specification | Yes | All our experiments are conducted on NVIDIA A100 80GB GPU servers. |
| Software Dependencies | No | The paper mentions software components like 'PyTorch Huggingface', 'Faster Transformer', and 'CUTLASS INT8 GEMM kernels', but does not specify their version numbers. |
| Experiment Setup | Yes | The migration strength α = 0.5 is a general sweet spot for all the OPT and BLOOM models, and α = 0.75 for GLM-130B... We calibrate the smoothing factors and the static quantization step sizes once with 512 random sentences... We measure the end-to-end latency of generating all hidden states for a batch of 4 sentences... when the sequence length is 256... we clip the top 2% tokens when calibrating the static quantization step sizes for GLM-130B... |