Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Learning Grouped Lattice Vector Quantizers for Low-Bit LLM Compression

Authors: Xi Zhang, Xiaolin Wu, Jiamang Wang, Weisi Lin

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on multiple benchmarks show that our approach achieves a better trade-off between model size and accuracy compared to existing post-training quantization baselines, highlighting its effectiveness in deploying large models under stringent resource constraints.
Researcher Affiliation Collaboration 1Nanyang Technological University 2Alibaba Group 3Southwest Jiaotong University
Pseudocode Yes We present the full pseudocode for GLVQ in Algorithm 1. Starting from initial values G(0) g and ยต(0) g , each iteration (i) reshapes the weight block, applies the group-specific ยตg-law companding, and produces latent vectors Yg; (ii) quantizes these vectors via Babai rounding to obtain integer codes Zg; (iii) reconstructs provisional weights c Wg by inverse companding the lattice outputs; and (iv) minimizes a reconstruction loss augmented with a Frobenius penalty on Gg. Gradients update the generation matrix and curvature parameter, while Zg is implicitly refreshed by Babai rounding at every iteration. The loop stops when the relative loss reduction falls below ฮต, returning the final compact representation c Wg that combines group-specific lattice precision with adaptive companding.
Open Source Code Yes Our source code is available on Git Hub repository: https://github.com/xzhang9308/GLVQ.
Open Datasets Yes Our evaluation focuses on perplexity over the Wikitext-2 [37] and C4 [44] datasets, utilizing context lengths of 2048 for Llama 1 and 4096 for Llama 2 models. For zero-shot tasks, we use the LM Eval framework to measure accuracy on tasks such as ARC, PIQA, and the Winograd Schema Challenge (Wino). We adopt 4M tokens from the Red Pajama 1T dataset [57] as the calibration sequences in our experiments.
Dataset Splits Yes Our evaluation focuses on perplexity over the Wikitext-2 [37] and C4 [44] datasets, utilizing context lengths of 2048 for Llama 1 and 4096 for Llama 2 models. For zero-shot tasks, we use the LM Eval framework to measure accuracy on tasks such as ARC, PIQA, and the Winograd Schema Challenge (Wino).
Hardware Specification Yes We implement our method using Py Torch [42] and CUDA [39], with all experiments conducted on NVIDIA A100 GPUs. For timing experiments, we use an NVIDIA RTX 4090 GPU.
Software Dependencies Yes We implement our method using Py Torch [42] and CUDA [39], with all experiments conducted on NVIDIA A100 GPUs. For timing experiments, we use an NVIDIA RTX 4090 GPU. [39] NVIDIA. CUDA Toolkit. https://developer.nvidia.com/cuda-toolkit, 2020. Version 10.2.89.
Experiment Setup Yes Specifically, we evaluate perplexity on the Wikitext-2 [37] and C4 [44] datasets, utilizing context lengths of 2048 for Llama 1 and 4096 for Llama 2 models. For zero-shot tasks, we use the LM Eval framework to measure accuracy on tasks such as ARC, PIQA, and the Winograd Schema Challenge (Wino). We adopt 4M tokens from the Red Pajama 1T dataset [57] as the calibration sequences in our experiments. We implement two variants of our model with lattice dimensions d = 8 and d = 32, referred to as GLVQ-8D and GLVQ-32D, respectively.