Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

ResQ: Mixed-Precision Quantization of Large Language Models with Low-Rank Residuals

Authors: Utkarsh Saxena, Sayeh Sharify, Kaushik Roy, Xin Wang

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct extensive experiments on various models and language tasks and show that Res Q outperforms related state-of-the-art approaches. With the Llama and Qwen2.5 families of models, we demonstrate that Res Q outperforms recent uniform and mixed precision PTQ methods on a variety of benchmarks, achieving up to 33% lower perplexity on Wikitext than the next best method Spin Quant, and upto 5 speedup over 16-bit baseline.
Researcher Affiliation	Collaboration	1Department of Electrical and Computer Engineering, Purdue University, West Lafayette, USA 2d-Matrix, Santa Clara, USA. Correspondence to: Utkarsh Saxena <EMAIL>.
Pseudocode	No	The paper describes the quantization scheme and projections using mathematical equations and descriptive text, but it does not include a clearly labeled 'Pseudocode' or 'Algorithm' block, nor structured steps formatted like code.
Open Source Code	Yes	Code is available here.1 1https://github.com/utkarsh-dmx/project-r esq
Open Datasets	Yes	We evaluate the quantization approaches on a range of tasks which measure the language modeling ability: perplexity on Wikitext (Merity et al., 2017), common sense reasoning ability: average 0-shot accuracy on Arc-c/e (Clark et al., 2018), Bool Q (Clark et al., 2019), Hella Swag (Zellers et al., 2019), Openbook QA (Mihaylov et al., 2018), PIQA (Bisk et al., 2020), SIQA (Sap et al., 2019), Wino Grande (Sakaguchi et al., 2021), language understanding: 0-shot accuracy on MMLU (Hendrycks et al., 2021), mathematical understanding: 5-shot GSM8K (Cobbe et al., 2021), dialogue summarization: samsum (Gliwa et al., 2019) and qmsum (Zhong et al., 2021) from Long Bench (Bai et al., 2024), code completion: repobench-p (Liu et al., 2024b) from Long Bench, and multi-modal understanding: MMMU (Yue et al., 2024).
Dataset Splits	Yes	For calibration data, we use 512 randomly choses samples for Wikitext to obtain the projection matrices. While for GPTQ we use 128 randomly choses samples from Wiktiext following the original work Frantar et al. 2023. We use lm evaluation harness version 0.4.5 (Gao et al., 2024) and Long Bench (Bai et al., 2024) for all the evaluation tasks. For Arc-c/e, Hellaswag, Open Book QA, PIQA tasks we report acc norm while for Bool Q, SIQA and Winogrande we report acc.
Hardware Specification	Yes	The entire process, including obtaining projections and quantization, runs on a single NVIDIA A100 GPU; for Meta-Llama-3-8B, it takes 35 minutes. On an NVIDIA RTX 3090 GPU, we achieve a 1.61 to 3.03 speedup with Res Q over the 16-bit baseline... We evaluate end-to-end batched inference latency on a GPU server with 3 NVIDIA A100 (82 GB) GPUs running Meta-Llama-3-70B.
Software Dependencies	Yes	We implement the mixed-precision quantization using CUDA 11.8 and Py Torch. We use lm evaluation harness version 0.4.5 (Gao et al., 2024).
Experiment Setup	Yes	We use per-token asymmetric quantization for activations, per-channel symmetric quantization for weights, and per-head asymmetric quantization for the KV cache. We fuse the projection matrices UA, UB, UD into weights and apply GPTQ (Frantar et al., 2023) for weight quantization. The KV cache, as well as the weights and activations of all Linear layers (except mlp.down proj), are quantized to 4-bit precision, with 1/8 of channels retained in 8-bit precision. While, the weights and activations within down proj are uniformly quantized to 4-bit precision. The rotation matrix is trained using Cayley SGD (Li et al., 2020) for 100 training steps at batch size 8 and learning rate of 1.5. The training data involves samples of sequence length 2048 taken from Wikitext.