reproducibilityindex.ai

Norm Tweaking: High-Performance Low-Bit Quantization of Large Language Models

Authors: Liang Li, Qingyuan Li, Bo Zhang, Xiangxiang Chu

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct extensive experiments on various datasets using several open-sourced LLMs. Our method demonstrates signiﬁcant improvements in both weight-only quantization and joint quantization of weights and activations, surpassing existing PTQ methods. On GLM-130B and OPT-66B, our method even achieves the same level of accuracy at 2bit quantization as their ﬂoat ones.
Researcher Affiliation	Industry	Liang Li, Qingyuan Li, Bo Zhang, Xiangxiang Chu Meituan {liliang58,liqingyuan02,zhangbo97,chuxiangxiang}@meituan.com
Pseudocode	Yes	Algorithm 1: Norm-Tweaking Input: Pre-trained LLM model Output: Quantized LLM model 1: Generate calibration dataset (n samples = 128, token length = 2048) using pre-trained LLM model 2: for each layer-l in the Transformer structure (L layers total) do 3: if l = 0 then 4: use calibration data as input 5: else 6: use last output q Outl 1 as input 7: end if 8: Calculate the ﬂoat output f Outl 9: Quantize the weights of layer l 10: Freeze all Linear s weights in layer l 11: for each it for total Iters do 12: Calculate the ﬂoat output q Outl 13: Calculate Ldist between f Outl and q Outl 14: Backward and update Layer Norms parameters 15: end for 16: end for 17: Get the high-performance quantized LLMs
Open Source Code	No	The paper does not explicitly state that source code for the methodology is provided or include a link to a code repository.
Open Datasets	Yes	Our primary experimental evaluations are performed on the LAMBADA dataset (Paperno et al. 2016), which is renowned for its high demand for the understanding ability of natural language. To further substantiate the generalization of our method on different datasets, we employed Benchmark Harness (Gao et al. 2021) to conduct tests on a broader spectrum of datasets, encompassing Hella Swag (Zellers et al. 2019), PIQA (Bisk et al. 2020), Wino Grande (Sakaguchi et al. 2021), Open Book QA (Mihaylov et al. 2018), and some datasets from the General Language Understanding Evaluation (GLUE) benchmark. We also use Wiki Text-2 (Merity et al. 2016), PTB (Marcus et al. 1994), C4 (Raffel et al. 2020) in Table 5, to provide some demonstrations of text generated by quantized LLMs, which helps to more intuitively visualize the performance recovery of Norm Tweaking.
Dataset Splits	Yes	Following the settings in GPTQ, we used a calibration dataset size with n samples=128, with the maximum sequence length token length=2048.
Hardware Specification	Yes	All experiments were conducted on a single NVIDIA A100 GPU.
Software Dependencies	No	The paper mentions optimizers (Adam, RMSNorm) and a deployment framework (Faster Transformer) but does not provide specific version numbers for any software libraries, frameworks, or dependencies used in the experiments.
Experiment Setup	Yes	In our experiments, we typically use a grid search to obtain the optimal learning rate, with an initial value set at 1e-5. Following the settings in GPTQ, we used a calibration dataset size with n samples=128, with the maximum sequence length token length=2048. Our Norm-Tweaking results presented in the paper, unless otherwise noted, are obtained using weight-only quantization based on the GPTQ algorithm. In the tweaking process, we choose the Adam optimizer (Kingma and Ba 2015) to update the Layer Norm parameters of LLMs or the RMSNorm (Zhang and Sennrich 2019) parameters of LLa MA. The learning rate needs to be carefully set. A large learning rate would damage the ﬁnal results. In our experiments, we typically use a grid search to obtain the optimal learning rate, with an initial value set at 1e-5. We also adopt a small learning rate and design a step scheduler to assign different learning rates for the subsequent layers. For example, we use a very small number of iterations during tuning, typically only one iteration on the calibration text is required.