Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

RSAVQ: Riemannian Sensitivity-Aware Vector Quantization for Large Language Models

Authors: Zukang Xu, Xing Hu, Qiang Wu, Dawei Yang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments demonstrate that RSAVQ outperforms existing methods for LLMs. For example, in 2-bit quantization of LLa MA-3 8B, RSAVQ leads baselines like VPTQ and Qu IP# by 0.4 in perplexity (PPL) and 1.5 in zero-shot accuracy. This work offers a practical solution for constrained environments and a theoretical bridge between information geometry and the quantization of neural networks, advancing efficient deep learning. 5 Experiments
Researcher Affiliation	Industry	Zukang Xu Houmo AI Xing Hu Houmo AI Qiang Wu Houmo AI Dawei Yang Houmo AI Corresponding author: EMAIL
Pseudocode	Yes	A.13 Algorithm Algorithm 1 RSAVQ: Riemannian Sensitivity-Aware Vector Quantization
Open Source Code	No	Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [No] Justification: We aim to collect the code as soon as possible in a Git repository.
Open Datasets	Yes	The calibration dataset used in our experiments is sampled from the Red_Pajama dataset[46]. To evaluate the performance of the baselines, we compute the perplexity (PPL) of the models on the Wiki Text-2 dataset[34]. We evaluate the generalization capability on several zero-shot tasks, including Wino Grand[37], Hella Swag[52], PIQA[4], ARC-e[5], and ARC-c[12]. All evaluations are performed using the open-source LM-Evaluation-Harness[19] toolkit.
Dataset Splits	Yes	To evaluate the performance of the baselines, we compute the perplexity (PPL) of the models on the Wiki Text-2 dataset[34]. We evaluate the models by randomly sampling sequences from the dataset with the same length as the calibration data. Lower perplexity indicates better preservation of the original output distribution. For direct comparison with methods like VPTQ[31] and Quip#[42], we use the same sequence lengths during testing. Specifically, we test PPL with sequence lengths 4096 for the LLa MA-2 models and 2048 for the LLa MA-3 models. Additionally, we evaluate generalization capability on several zero-shot tasks, including Wino Grand[37], Hella Swag[52], PIQA[4], ARC-e[5], and ARC-c[12]. All evaluations are performed using the open-source LM-Evaluation-Harness[19] toolkit.
Hardware Specification	Yes	Unless otherwise specified, all experiments are conducted on NVIDIA A100-80GB GPU. In terms of hardware efficiency, we conducted speed and memory usage tests on the LLa MA-2 7B and LLa MA-2 13B models running inference on a single NVIDIA A100 GPU.
Software Dependencies	No	The paper does not explicitly mention specific software dependencies with version numbers. The NeurIPS checklist for this section is answered as [NA], indicating no such details are provided.
Experiment Setup	Yes	For RSAVQ, we use a k-means-based VQ approach similar to VPTQ, with the following settings: the vector length is set to 6, and weight matrices are divided into 4 groups, with each sharing its own codebook. Our experiment showed that as λ increased, the quantization accuracy first improved and then decreased. This indicates that λ has an optimal range where the projection between quantization error and the natural gradient direction is most effective. Based on our experiments, we found that the optimal range for λ lies between 0.01 and 0.1. A.11 Ablation Studies on Group Size and Codebook Vector Length Effect of group size We conducted experiments on the LLa MA2-7B model with Wiki Text2 (sequence length 4096, 2-bit quantization, vector length 6). Results in Table 6 show that performance improves as the number of groups increases, but the gain diminishes after 4 groups. Effect of codebook vector length Vector length plays a central role in vector quantization. We analyzed its impact using LLa MA2-7B on Wiki Text2 (sequence length 4096, 2-bit quantization, 2 groups for product quantization, 4 groups for WCSG). Table 7 shows that longer vector lengths yield slight performance gains (e.g., ppl decreases from 5.81 at length 6 to 5.62 at length 14).