Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
LeanQuant: Accurate and Scalable Large Language Model Quantization with Loss-error-aware Grid
Authors: Tianyi Zhang, Anshumali Shrivastava
ICLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments with recent LLMs demonstrate that Lean Quant is highly accurate, comparing favorably against competitive baselines in model quality, and scalable, achieving very accurate quantization of Llama-3.1 405B, one of the largest open-source LLMs to date, using two Quadro RTX 8000-48GB GPUs in 21 hours. Our code is available at https://github.com/Lean Models/Lean Quant. We conduct extensive experiments to validate Lean Quant s effectiveness and scalability in LLM quantization against competitive baselines. |
| Researcher Affiliation | Collaboration | Tianyi Zhang Dept. of Computer Science, Rice University x MAD.ai Houston, TX EMAIL Anshumali Shrivastava Dept. of Computer Science, Rice University x MAD.ai Third AI Corp. Ken Kennedy Institute Houston, TX EMAIL |
| Pseudocode | Yes | Algorithm 1 Lean Quant for LLM quantization Algorithm 2 Lean Quant-Exact for Millon-parameter Networks |
| Open Source Code | Yes | Our code is available at https://github.com/Lean Models/Lean Quant. |
| Open Datasets | Yes | We use a small calibration set of 128 sequences of 2048 tokens from the C4 dataset (Raffel et al., 2020) for computing the Hessian H, and set p = 4. We evaluate quantized LLMs using the perplexity metric on the datasets Wiki Text2 (Merity et al., 2016) and C4 (Raffel et al., 2020), and zero-shot accuracy on the benchmarks ARC (Clark et al., 2018), LAMBADA (Paperno et al., 2016), MMLU (Hendrycks et al., 2020), Hella Swag (Zellers et al., 2019), PIQA (Bisk et al., 2020), and Wino Grande (Sakaguchi et al., 2021). We also quantize and evaluate the instruction-following Llama-3-8B-Instruct using Open AI GPT-4o (2024-05-13) as a judge on the MT-Bench (Zheng et al., 2023). |
| Dataset Splits | Yes | For Lean Quant models, we use a small calibration set of 128 sequences of 2048 tokens from the C4 dataset (Raffel et al., 2020) for computing the Hessian H, and set p = 4. We follow the perplexity evaluation procedure described by (Frantar et al., 2022): sequences from the test set of the Wiki Text2 and C4 datasets (Merity et al., 2016; Raffel et al., 2020) are concatenated into 128 sequences of length 2048 tokens for perplexity testing. |
| Hardware Specification | Yes | Lean Quant models are quantized using a machine quipped with an L40s-48GB GPU, an AMD EPYC 7R13 48-Core CPU, and 370GB of RAM. To fit Llama-3.1-405B-Instruct in RAM, which is around 800GB in size, we use a machine equipped with 2 Quadro RTX 8000 GPUs, an AMD EPYC 7742 64-Core CPU, and 1.48TB of RAM. The inference efficiency in Table 12 is evaluated on an NVIDIA A100-40GB GPU. |
| Software Dependencies | No | The paper mentions specific tools and models like llama.cpp (Gerganov, 2023), PyTorch (Paszke et al., 2019), lm-evaluation-harness (Gao et al., 2023), exllamav2 (Turboderp-org, 2024), and Open AI GPT-4o (2024-05-13). However, it only provides a specific version (date) for GPT-4o, and not for other key software components like PyTorch, llama.cpp, or exllamav2. The provided years for references are for the papers, not specific software versions. |
| Experiment Setup | Yes | For Lean Quant models, we use a small calibration set of 128 sequences of 2048 tokens from the C4 dataset (Raffel et al., 2020) for computing the Hessian H, and set p = 4. For the baselines, we use the quantized models provided by their official repository where possible, and quantize the unavailable models using their official codebase and recommended hyperparameters. For Omni Quant, we set the training epochs to 20, enable Learnable Weight Clipping (LWC), set an LWC learning rate of 1e-2. For Squeeze LLM, there is no tunable parameters. For GPTQ, we turn on activation ordering (quantizing columns in order of decreasing activation size) for more accurate model. In our experiments, we set p = 4 for all models. The parameter T determines the granularity of the search; in our experiments, we set T = 2048. To control the extent of range reduction, we introduce the parameter t, which determines the degree of shrinkage. Lower bit widths require more aggressive shrinking due to the limited number of grid points. We set t for b-bit quantization as follows: 0.2T if b = 4, 0.3T if b = 3, 0.4T if b = 2. |