Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

ParetoQ: Improving Scaling Laws in Extremely Low-bit LLM Quantization

Authors: Zechun Liu, Changsheng Zhao, Hanxian Huang, Sijia Chen, Jing Zhang, Jiawei Zhao, Scott Roy, Lisa Jin, Yunyang Xiong, Yangyang Shi, Lin Xiao, Yuandong Tian, Bilge Soran, Raghuraman Krishnamoorthi, Tijmen Blankevoort, Vikas Chandra

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experimentation shows that ternary, 2-bit, and 3-bit quantization maintains comparable performance in the size-accuracy trade-off and generally exceeds 4-bit and binary quantization. We conduct experiments on eight models including Mobile LLM [Liu et al., 2024b] 125M/350M/600M/1B/1.5B and LLa MA-3 [AI@Meta, 2024] 1B/3B/8B.
Researcher Affiliation	Industry	Zechun Liu1 Changsheng Zhao1 Hanxian Huang1 Sijia Chen1 Jing Zhang1 Jiawei Zhao1 Scott Roy1 Lisa Jin1 Yunyang Xiong1 Yangyang Shi1 Lin Xiao1 Yuandong Tian1 Bilge Soran1 Raghuraman Krishnamoorthi1 Tijmen Blankevoort1 Vikas Chandra1. Correspondence to: Zechun Liu EMAIL .
Pseudocode	No	The paper describes the quantization function using mathematical formulas (equations 3, 4, 5) but does not provide structured pseudocode or algorithm blocks.
Open Source Code	Yes	We will open-source our code and models.
Open Datasets	Yes	Our evaluation was carried out on eight zero-shot commonsense reasoning tasks and Wiki2 [Merity et al., 2016] test set. ARC-easy, ARC-challenge [Clark et al., 2018], Bool Q [Clark et al., 2019], PIQA [Bisk et al., 2020], SIQA [Sap et al., 2019], Hella Swag [Zellers et al., 2019], OBQA [Mihaylov et al., 2018], and Wino Grande [Sakaguchi et al., 2021], along with perplexity on the Wiki Text2 test set [Merity et al., 2016].
Dataset Splits	Yes	Our evaluation was carried out on eight zero-shot commonsense reasoning tasks and Wiki2 [Merity et al., 2016] test set. We conduct experiments on eight models including Mobile LLM [Liu et al., 2024b] 125M/350M/600M/1B/1.5B and LLa MA-3 [AI@Meta, 2024] 1B/3B/8B. The use of standard benchmarks like WikiText2 and the eight commonsense reasoning tasks implies the use of their predefined splits, which are typically well-documented in their original papers.
Hardware Specification	Yes	We measure the CPU latency of five Mobile LLM models on an Apple M1 Mac Book Pro (32GB RAM) using 6 threads. We measured the latency of LLa MA 3.2 models (1B, 3B, 8B) on an H100 NVL GPU (94GB memory).
Software Dependencies	No	The paper mentions software components like 'Adam W optimizer', 'v LLM', 'Machete kernel', and 'CUTLASS mixed precision backbone kernel', but does not provide specific version numbers for these or other key software dependencies.
Experiment Setup	Yes	We employed the Adam W [Loshchilov and Hutter, 2017] optimizer with zero weight decay for optimization. The training was distributed across 16 GPUs, with each GPU handling a batch size of 8. For binary, ternary, and 2-bit quantization settings, the optimization process spanned 120,000 iterations with initial learning rate of 2 10 5. For 3-bit and 4-bit settings, the process involved 40,000 iterations with initial learning rate of 1 10 5. The learning rate decayed to zero following cosine learning rate decay.