Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
ResQ: Mixed-Precision Quantization of Large Language Models with Low-Rank Residuals
Authors: Utkarsh Saxena, Sayeh Sharify, Kaushik Roy, Xin Wang
ICML 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct extensive experiments on various models and language tasks and show that Res Q outperforms related state-of-the-art approaches. With the Llama and Qwen2.5 families of models, we demonstrate that Res Q outperforms recent uniform and mixed precision PTQ methods on a variety of benchmarks, achieving up to 33% lower perplexity on Wikitext than the next best method Spin Quant, and upto 5 speedup over 16-bit baseline. |
| Researcher Affiliation | Collaboration | 1Department of Electrical and Computer Engineering, Purdue University, West Lafayette, USA 2d-Matrix, Santa Clara, USA. Correspondence to: Utkarsh Saxena <EMAIL>. |
| Pseudocode | No | The paper describes the quantization scheme and projections using mathematical equations and descriptive text, but it does not include a clearly labeled 'Pseudocode' or 'Algorithm' block, nor structured steps formatted like code. |
| Open Source Code | Yes | Code is available here.1 1https://github.com/utkarsh-dmx/project-r esq |
| Open Datasets | Yes | We evaluate the quantization approaches on a range of tasks which measure the language modeling ability: perplexity on Wikitext (Merity et al., 2017), common sense reasoning ability: average 0-shot accuracy on Arc-c/e (Clark et al., 2018), Bool Q (Clark et al., 2019), Hella Swag (Zellers et al., 2019), Openbook QA (Mihaylov et al., 2018), PIQA (Bisk et al., 2020), SIQA (Sap et al., 2019), Wino Grande (Sakaguchi et al., 2021), language understanding: 0-shot accuracy on MMLU (Hendrycks et al., 2021), mathematical understanding: 5-shot GSM8K (Cobbe et al., 2021), dialogue summarization: samsum (Gliwa et al., 2019) and qmsum (Zhong et al., 2021) from Long Bench (Bai et al., 2024), code completion: repobench-p (Liu et al., 2024b) from Long Bench, and multi-modal understanding: MMMU (Yue et al., 2024). |
| Dataset Splits | Yes | For calibration data, we use 512 randomly choses samples for Wikitext to obtain the projection matrices. While for GPTQ we use 128 randomly choses samples from Wiktiext following the original work Frantar et al. 2023. We use lm evaluation harness version 0.4.5 (Gao et al., 2024) and Long Bench (Bai et al., 2024) for all the evaluation tasks. For Arc-c/e, Hellaswag, Open Book QA, PIQA tasks we report acc norm while for Bool Q, SIQA and Winogrande we report acc. |
| Hardware Specification | Yes | The entire process, including obtaining projections and quantization, runs on a single NVIDIA A100 GPU; for Meta-Llama-3-8B, it takes 35 minutes. On an NVIDIA RTX 3090 GPU, we achieve a 1.61 to 3.03 speedup with Res Q over the 16-bit baseline... We evaluate end-to-end batched inference latency on a GPU server with 3 NVIDIA A100 (82 GB) GPUs running Meta-Llama-3-70B. |
| Software Dependencies | Yes | We implement the mixed-precision quantization using CUDA 11.8 and Py Torch. We use lm evaluation harness version 0.4.5 (Gao et al., 2024). |
| Experiment Setup | Yes | We use per-token asymmetric quantization for activations, per-channel symmetric quantization for weights, and per-head asymmetric quantization for the KV cache. We fuse the projection matrices UA, UB, UD into weights and apply GPTQ (Frantar et al., 2023) for weight quantization. The KV cache, as well as the weights and activations of all Linear layers (except mlp.down proj), are quantized to 4-bit precision, with 1/8 of channels retained in 8-bit precision. While, the weights and activations within down proj are uniformly quantized to 4-bit precision. The rotation matrix is trained using Cayley SGD (Li et al., 2020) for 100 training steps at batch size 8 and learning rate of 1.5. The training data involves samples of sequence length 2048 taken from Wikitext. |