SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models

Authors: Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, Song Han

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate up to 1.56 speedup and 2 memory reduction for LLMs with negligible loss in accuracy. Smooth Quant enables serving 530B LLM within a single node. Our work offers a turn-key solution that reduces hardware costs and democratizes LLMs.
Researcher Affiliation Collaboration 1Massachusetts Institute of Technology 2NVIDIA. Correspondence to: Guangxuan Xiao <xgx@mit.edu>, Ji Lin <jilin@mit.edu>.
Pseudocode No The paper contains figures and equations describing the method, but no explicit pseudocode or algorithm blocks are provided.
Open Source Code Yes https://github.com/mit-han-lab/smoothquant
Open Datasets Yes We use seven zero-shot evaluation tasks: LAMBADA (Paperno et al., 2016), Hella Swag (Zellers et al., 2019), PIQA (Bisk et al., 2020), Wino Grande (Sakaguchi et al., 2019), Open Book QA (Mihaylov et al., 2018), RTE (Wang et al., 2018), COPA (Roemmele et al., 2011), and one language modeling dataset Wiki Text (Merity et al., 2016) to evaluate the OPT and BLOOM models. We use MMLU (Hendrycks et al., 2020), MNLI (Williams et al., 2018), QNLI (Wang et al., 2018) and LAMBADA to evaluate the GLM-130B model...
Dataset Splits Yes We get a suitable α by running a quick grid search on a subset of the Pile (Gao et al., 2020) validation set. ... We use seven zero-shot evaluation tasks: LAMBADA (Paperno et al., 2016), Hella Swag (Zellers et al., 2019), PIQA (Bisk et al., 2020), Wino Grande (Sakaguchi et al., 2019), Open Book QA (Mihaylov et al., 2018), RTE (Wang et al., 2018), COPA (Roemmele et al., 2011), and one language modeling dataset Wiki Text (Merity et al., 2016) to evaluate the OPT and BLOOM models.
Hardware Specification Yes All our experiments are conducted on NVIDIA A100 80GB GPU servers.
Software Dependencies No The paper mentions software components like 'PyTorch Huggingface', 'Faster Transformer', and 'CUTLASS INT8 GEMM kernels', but does not specify their version numbers.
Experiment Setup Yes The migration strength α = 0.5 is a general sweet spot for all the OPT and BLOOM models, and α = 0.75 for GLM-130B... We calibrate the smoothing factors and the static quantization step sizes once with 512 random sentences... We measure the end-to-end latency of generating all hidden states for a batch of 4 sentences... when the sequence length is 256... we clip the top 2% tokens when calibrating the static quantization step sizes for GLM-130B...