reproducibilityindex.ai

SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models

Authors: Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, Song Han

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate up to 1.56 speedup and 2 memory reduction for LLMs with negligible loss in accuracy. Smooth Quant enables serving 530B LLM within a single node. Our work offers a turn-key solution that reduces hardware costs and democratizes LLMs.
Researcher Affiliation	Collaboration	1Massachusetts Institute of Technology 2NVIDIA. Correspondence to: Guangxuan Xiao <xgx@mit.edu>, Ji Lin <jilin@mit.edu>.
Pseudocode	No	The paper contains figures and equations describing the method, but no explicit pseudocode or algorithm blocks are provided.
Open Source Code	Yes	https://github.com/mit-han-lab/smoothquant
Open Datasets	Yes	We use seven zero-shot evaluation tasks: LAMBADA (Paperno et al., 2016), Hella Swag (Zellers et al., 2019), PIQA (Bisk et al., 2020), Wino Grande (Sakaguchi et al., 2019), Open Book QA (Mihaylov et al., 2018), RTE (Wang et al., 2018), COPA (Roemmele et al., 2011), and one language modeling dataset Wiki Text (Merity et al., 2016) to evaluate the OPT and BLOOM models. We use MMLU (Hendrycks et al., 2020), MNLI (Williams et al., 2018), QNLI (Wang et al., 2018) and LAMBADA to evaluate the GLM-130B model...
Dataset Splits	Yes	We get a suitable α by running a quick grid search on a subset of the Pile (Gao et al., 2020) validation set. ... We use seven zero-shot evaluation tasks: LAMBADA (Paperno et al., 2016), Hella Swag (Zellers et al., 2019), PIQA (Bisk et al., 2020), Wino Grande (Sakaguchi et al., 2019), Open Book QA (Mihaylov et al., 2018), RTE (Wang et al., 2018), COPA (Roemmele et al., 2011), and one language modeling dataset Wiki Text (Merity et al., 2016) to evaluate the OPT and BLOOM models.
Hardware Specification	Yes	All our experiments are conducted on NVIDIA A100 80GB GPU servers.
Software Dependencies	No	The paper mentions software components like 'PyTorch Huggingface', 'Faster Transformer', and 'CUTLASS INT8 GEMM kernels', but does not specify their version numbers.
Experiment Setup	Yes	The migration strength α = 0.5 is a general sweet spot for all the OPT and BLOOM models, and α = 0.75 for GLM-130B... We calibrate the smoothing factors and the static quantization step sizes once with 512 random sentences... We measure the end-to-end latency of generating all hidden states for a batch of 4 sentences... when the sequence length is 256... we clip the top 2% tokens when calibrating the static quantization step sizes for GLM-130B...