Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Q-Palette: Fractional-Bit Quantizers Toward Optimal Bit Allocation for Efficient LLM Deployment

Authors: DEOKJAE LEE, Hyun Oh Song

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on LLa MA 2, LLa MA 3, and Qwen models demonstrate that our MSQ framework with Q-Palette consistently outperforms strong data-free and data-aware weight-only PTQ baselines under both memory and latency-constrained settings.
Researcher Affiliation Academia 1Seoul National University, 2Neural Processing Research Center EMAIL
Pseudocode No def quantlut_sym(tlut, L, tlut_bits): with torch.no_grad(): lut = torch.arange(1 << L, device=tlut.device) lut = (lut + 1) * lut sflp = 1 - ((lut >> 15) & 1) * 2 lut = (lut >> (16 - tlut_bits - 1)) & ((1 << tlut_bits) - 1) lut = tlut[lut] lut[:, 0] = lut[:, 0] * sflp return lut
Open Source Code Yes The code is available at https://github.com/snu-mllab/Q-Palette.
Open Datasets Yes For evaluating language modeling performance, we measure perplexity on the Wiki Text2 dataset [37], using sequence lengths of 4096 tokens for LLa MA 2 models and 8192 tokens for LLa MA 3 models. Additionally, we report zero-shot accuracy on five downstream tasks: ARC-easy, ARC-challenge, Hella Swag, Pi QA, and Wino Grande [37, 8, 56, 3, 44]. Zero-shot evaluations are conducted using the lm_eval library (version 0.4.4). ... we replace the KL-divergence loss computed over randomly generated 128K tokens with the perplexity loss computed over 1M tokens from the Red Pajama dataset [52].
Dataset Splits Yes For evaluating language modeling performance, we measure perplexity on the Wiki Text2 dataset [37], using sequence lengths of 4096 tokens for LLa MA 2 models and 8192 tokens for LLa MA 3 models. Additionally, we report zero-shot accuracy on five downstream tasks: ARC-easy, ARC-challenge, Hella Swag, Pi QA, and Wino Grande [37, 8, 56, 3, 44]. Zero-shot evaluations are conducted using the lm_eval library (version 0.4.4).
Hardware Specification Yes GPU: NVIDIA RTX 4090 CPU: AMD EPYC 7B13 64-Core Processor... GPU: NVIDIA RTX 3090 CPU: AMD EPYC 7402 24-Core Processor
Software Dependencies Yes OS: Ubuntu 22.04.5 CUDA Version: 12.4... Zero-shot evaluations are conducted using the lm_eval library (version 0.4.4).
Experiment Setup Yes Figure 1: ...evaluated on the LLa MA 3.1-8B model using an RTX4090 GPU with a batch size of 1. ... we employ FLUTE with a codebook size of 23 and a group size of 64, resulting in an average bitwidth of 3.25. ... For evaluating language modeling performance, we measure perplexity on the Wiki Text2 dataset [37], using sequence lengths of 4096 tokens for LLa MA 2 models and 8192 tokens for LLa MA 3 models.