DuQuant: Distributing Outliers via Dual Transformation Makes Stronger Quantized LLMs

Authors: Haokun Lin, Haobo Xu, Yichen WU, Jingzhi Cui, Yingtao Zhang, Linzhan Mou, Linqi Song, Zhenan Sun, Ying Wei

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive evaluations demonstrate that our Du Quant approach significantly outperforms existing 4-bit weight-activation quantization baselines across various benchmarks. Notably, Du Quant achieves a 5% improvement in Commonsense QA tasks across all LLa MA model sizes and a 10% increase in zero-shot MMLU benchmarks for the Vicuna-v1.5-13B. Moreover, in practical applications with the LLa MA2-7B model, Du Quant not only accelerates pre-filling phase by up to 2.08 but also reduces memory usage during decoding phase by 3.50 , with minimal impact on performance: only a 0.61 increase in perplexity and a 2.71% drop in accuracy compared to the FP16 model.
Researcher Affiliation Academia Haokun Lin 1,3,4, Haobo Xu 2, Yichen Wu 4, Jingzhi Cui2, Yingtao Zhang2, Linzhan Mou5, Linqi Song4, Zhenan Sun 1,3, Ying Wei 4,5 1 School of Artificial Intelligence, University of Chinese Academy of Sciences 2 Tsinghua University 3 NLPR & MAIS, Institute of Automation, CAS 4 City University of Hong Kong 5 Zhejiang University
Pseudocode Yes Appendix G: Algorithm for Rotation Matrix and Algorithm 1 Construction of the Rotation Matrix
Open Source Code Yes Our code is available at https://github.com/Hsu1023/Du Quant.
Open Datasets Yes We evaluate quantized pre-trained LLMs on language generation tasks and commonsense QA tasks. Specifically, we assess the perplexity on Wiki Text2 [40] and C4 [44] datasets, as well as the zero-shot accuracy on PIQA [6], ARC [12], Bool Q [11], Hella Swag [70], and Wino Grande [45] datasets. Moreover, we evaluate quantized Vicuna models on MMLU [21] and MT-Bench [76] benchmarks, as well as their long-form generative capabilities on Long Bench [4].
Dataset Splits Yes For calibration data, following [47, 39, 34], we randomly select 128 sampled sequences from the Wiki Text2 dataset, with the sequence length of 2048.
Hardware Specification Yes In this work, all experiments are done on NVIDIA RTX 3090 GPUs for small-scale models and NVIDIA A100 GPUs for large-scale models.
Software Dependencies No No specific software dependencies with version numbers (e.g., PyTorch 1.x, CUDA 11.x) were explicitly mentioned in the paper.
Experiment Setup Yes In line with prior studies [34, 47, 39], we apply per-token activation quantization and per-channel weight quantization. Given that W8A8 quantization has been established as lossless in precision by Smooth Quant [64], our primary evaluation in this paper focuses on 4-bit and 6-bit quantization for weights and activations. As for details, we quantize all intermediate activations, excluding the Soft Max output. Moreover, we have developed two types of quantized models, denoted as Du Quant and Du Quant+LWC . For Du Quant, we employ round-to-nearest quantization, using a clipping ratio of 0.9 for activations and 0.8 for weights. To improve weight matrix quantization, Du Quant+LWC integrates the learnable weight clipping (LWC) technique from Omni Quant. Concretely, LWC adjusts weights by training parameters γ, β [0, 1] to compute step size = γ max(X) β min(X) 2b 1 in Eqn. (1). Notably, the smoothing diagonal matrix and the learned weight clipping factor can be integrated into the quantized weights, introducing no additional computational or memory costs. More details and hyperparameters are left in Appendix C. (Further details in Appendix C): For calibration data, following [47, 39, 34], we randomly select 128 sampled sequences from the Wiki Text2 dataset, with the sequence length of 2048. For rotation and permutation transformations, the rotation block size 2n is set to 128, and maximum greedy search steps N equals 256. We adopt once permutation times for efficiency. For smooth parameter α, we set it to 0.6 for Du Quant and 0.5 Du Quant+LWC. We clip the maximum activation values in all projection blocks, and the clipping ratio is set to 0.9. For Du Quant we also clip the maximum values in weight matrices, with a clipping ratio of 0.8.