Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

HBLLM: Wavelet-Enhanced High-Fidelity 1-Bit Quantization for LLMs

Authors: Ningning Chen, Weicai Ye, Ying Jiang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments conducted on the OPT and LLa MA models demonstrate that HBLLM achieves state-of-the-art performance in 1-bit quantization, attaining a perplexity of 6.71 on LLa MA2-13B with an average weight storage of only 1.08 bits. We conduct extensive experiments on OPT [37], LLa MA family [32] of LLMs. Results show that HBLLM achieves state-of-the-art performance under 1-bit quantization.
Researcher Affiliation	Academia	Ningning Chen1, EMAIL Weicai Ye1,2 EMAIL Ying Jiang1,2 EMAIL 1Sun Yat-sen University 2Guangdong Province Key Laboratory of Computational Science
Pseudocode	Yes	Algorithm 1 Framework of HBLLM: Details of each function are shown in Algorithm E.1
Open Source Code	Yes	Code available at: https://github.com/Yeyke/HBLLM.
Open Datasets	Yes	We measure language modeling capabilities of these models by evaluating their perplexity on the C4[26], Wiki Text2[22] and PTB[21] datasets. Additionally, we assess zero-shot accuracy on various Common Sense Reasoning Tasks such as PIQA[4], Bool Q[7], Open Book QA[23], Wino Grande[28], ARC-e, ARC-c[8], Hella Swag[36], which are commonly used for evaluating the performance of LLM quantization methods. To further enhance evaluation coverage, we also include COPA[27] for causal reasoning and LAMBADA[25] for long-context language modeling. All evaluations are conducted using the open-source LLM evaluation framework, LM-Evaluation-Harness[24].
Dataset Splits	Yes	All evaluations are conducted using the open-source LLM evaluation framework, LM-Evaluation-Harness[24]. For the calibration data, we follow the settings adopted in GPTQ and Bi LLM, selecting 128 samples from the C4 dataset, with a sequence length of 2048. During quantization, we set the block size to 128 in Bi LLM, PB-LLM, ARB-LLM, and HBLLM.
Hardware Specification	Yes	All experiments are conducted with Py Torch on NVIDIA GeForce RTX 3090 GPUs with 24GB of memory. Quantization for models <30B was run on 4 RTX 3090 (24GB), and for models 30B on A800-80GB. See Section 4.1.
Software Dependencies	No	The paper mentions that experiments are conducted with PyTorch, but does not provide specific version numbers for PyTorch or any other software dependencies.
Experiment Setup	Yes	For the calibration data, we follow the settings adopted in GPTQ and Bi LLM, selecting 128 samples from the C4 dataset, with a sequence length of 2048. During quantization, we set the block size to 128 in Bi LLM, PB-LLM, ARB-LLM, and HBLLM. Activations are kept in full precision (FP16).