QLLM: Accurate and Efficient Low-Bitwidth Quantization for Large Language Models
Authors: Jing Liu, Ruihao Gong, Xiuying Wei, Zhiwei Dong, Jianfei Cai, Bohan Zhuang
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on LLa MA-1 and LLa MA-2 show that QLLM is able to obtain accurate quantized models efficiently. |
| Researcher Affiliation | Collaboration | Jing Liu1,2 , Ruihao Gong2,3, Xiuying Wei2,4, Zhiwei Dong2,5, Jianfei Cai1, Bohan Zhuang1 1ZIP Lab, Monash University 2Sense Time Research 3Beihang University 4School of Computer and Communication Sciences, EPFL 5University of Science and Technology Beijing |
| Pseudocode | Yes | Algorithm 1: Algorithm of Adaptive Channel Reassembly for one layer in LLM. |
| Open Source Code | Yes | Code is available at ZIP Lab and Model TC. |
| Open Datasets | Yes | We apply QLLM to quantize the LLa MA-1 (Touvron et al., 2023a) and LLa MA-2 (Touvron et al., 2023b) families. ... Additionally, we evaluate the perplexity... on Wiki Text2 (Merity et al., 2017), PTB (Marcus et al., 1993) and C4 (Raffel et al., 2020). |
| Dataset Splits | Yes | Following Omni Quant (Shao et al., 2023), we construct the calibration set with 128 randomly sampled sequences from Wiki Text2, each with a sequence length of 2048. |
| Hardware Specification | Yes | For example, QLLM quantizes the 4-bit LLa MA-2-70B within 10 hours on a single A100-80G GPU... All training experiments are conducted on a single NVIDIA A100 80G GPU... We measure the inference speed of QLLM on NVIDIA RTX 3090 GPUs... allows EEC to quantize LLa MA-1-65B on a single 24GB consumer-grade GPU, such as the NVIDIA RTX 4090... |
| Software Dependencies | No | The paper mentions software components like Triton, Adam W, QUIK codebase, and Auto GPTQ, and cites related papers (e.g., Tillet et al., 2019; Ashkboos et al., 2023). However, it does not provide explicit version numbers for these software dependencies, which is required for reproducibility. |
| Experiment Setup | Yes | The rank r of the introduced low-rank parameters is set to 4, and these parameters are trained for 10 epochs with a mini-batch size of 1. We carry out the reconstruction using 4 Attention-FFN blocks. Adam W (Loshchilov & Hutter, 2019) with a linear learning rate decay scheduler is used following (Yao et al., 2022). The learning rate is set to 5 10 4 in most experiments; for LLa MA2-70B, it is set to 1 10 4. |