Evaluating Quantized Large Language Models

Authors: Shiyao Li, Xuefei Ning, Luning Wang, Tengxuan Liu, Xiangsheng Shi, Shengen Yan, Guohao Dai, Huazhong Yang, Yu Wang

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental This paper presents a thorough evaluation of these factors by evaluating the effect of PTQ on Weight, Activation, and KV Cache on 11 model families, including OPT, LLa MA2, Falcon, Bloomz, Mistral, Chat GLM, Vicuna, Long Chat, Stable LM, Gemma, and Mamba, with parameters ranging from 125M to 180B. The evaluation encompasses five types of tasks: basic NLP, emergent ability, trustworthiness, dialogue, and long-context tasks. Moreover, we also evaluate the state-of-theart (SOTA) quantization methods to demonstrate their applicability. Based on the extensive experiments, we systematically summarize the effect of quantization, provide recommendations to apply quantization techniques, and point out future directions.
Researcher Affiliation Collaboration 1Department of Electronic Engineering, Tsinghua University, Beijing, China 2Infinigence AI 3Shanghai Jiaotong University, Shanghai, China.
Pseudocode No The paper includes mathematical formulas for quantization but no explicitly labeled "Pseudocode" or "Algorithm" blocks, nor does it present structured steps in a code-like format.
Open Source Code Yes The code can be found in https: //github.com/thu-nics/qllm-eval.
Open Datasets Yes As illustrated in Table 2, we evaluate five distinct types of tasks in LLMs, including the basic NLP tasks in Sec. 3, the tasks for the emergent abilities in Sec. 4, the trustworthiness tasks in Appendix D, the dialogue tasks in Sec. 6 and the long-context processing tasks in Sec. 7. More details about datasets and evaluation workflows are in the Appendix. Appendix B.1 details specific datasets like CHID (Zheng et al., 2019), Winogrande (Sakaguchi et al., 2021), RACE (Lai et al., 2017), LAMBADA (Paperno et al., 2016), PIQA (Bisk et al., 2020), SIQA (Sap et al., 2019), which are standard and publicly available.
Dataset Splits Yes The CHID dataset (Zheng et al., 2019) is a Chinese idiom reading comprehension task... The dataset is split into the train/dev/test sets. We evaluate the quantized LLMs on the test split... The Winogrande dataset (Sakaguchi et al., 2021)... The whole dataset is divided into train/dev/test sets. The evaluations are based on the dev set... The MMLU dataset... In total, the dataset consists of 15,908 questions, split into the dev subset , the validation subset , and the test set.
Hardware Specification No The paper does not specify the hardware used to run the experiments, such as specific GPU or CPU models.
Software Dependencies No The paper does not provide specific version numbers for software dependencies used in the experiments.
Experiment Setup Yes For the group-wise KV Cache and Weight-only Quantization, we set the group size to be the hidden dimension size of one head in the model s multi-head attention block. Specifically, for the Mistral, LLa MA2, Vicuna, Long Chat, and Chat GLM families, the group size is 128. For the Falcon family, the group size is 64. For the Bloomz and OPT families, different LLMs have different group sizes.