Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Q-Palette: Fractional-Bit Quantizers Toward Optimal Bit Allocation for Efficient LLM Deployment
Authors: DEOKJAE LEE, Hyun Oh Song
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on LLa MA 2, LLa MA 3, and Qwen models demonstrate that our MSQ framework with Q-Palette consistently outperforms strong data-free and data-aware weight-only PTQ baselines under both memory and latency-constrained settings. |
| Researcher Affiliation | Academia | 1Seoul National University, 2Neural Processing Research Center EMAIL |
| Pseudocode | No | def quantlut_sym(tlut, L, tlut_bits): with torch.no_grad(): lut = torch.arange(1 << L, device=tlut.device) lut = (lut + 1) * lut sflp = 1 - ((lut >> 15) & 1) * 2 lut = (lut >> (16 - tlut_bits - 1)) & ((1 << tlut_bits) - 1) lut = tlut[lut] lut[:, 0] = lut[:, 0] * sflp return lut |
| Open Source Code | Yes | The code is available at https://github.com/snu-mllab/Q-Palette. |
| Open Datasets | Yes | For evaluating language modeling performance, we measure perplexity on the Wiki Text2 dataset [37], using sequence lengths of 4096 tokens for LLa MA 2 models and 8192 tokens for LLa MA 3 models. Additionally, we report zero-shot accuracy on five downstream tasks: ARC-easy, ARC-challenge, Hella Swag, Pi QA, and Wino Grande [37, 8, 56, 3, 44]. Zero-shot evaluations are conducted using the lm_eval library (version 0.4.4). ... we replace the KL-divergence loss computed over randomly generated 128K tokens with the perplexity loss computed over 1M tokens from the Red Pajama dataset [52]. |
| Dataset Splits | Yes | For evaluating language modeling performance, we measure perplexity on the Wiki Text2 dataset [37], using sequence lengths of 4096 tokens for LLa MA 2 models and 8192 tokens for LLa MA 3 models. Additionally, we report zero-shot accuracy on five downstream tasks: ARC-easy, ARC-challenge, Hella Swag, Pi QA, and Wino Grande [37, 8, 56, 3, 44]. Zero-shot evaluations are conducted using the lm_eval library (version 0.4.4). |
| Hardware Specification | Yes | GPU: NVIDIA RTX 4090 CPU: AMD EPYC 7B13 64-Core Processor... GPU: NVIDIA RTX 3090 CPU: AMD EPYC 7402 24-Core Processor |
| Software Dependencies | Yes | OS: Ubuntu 22.04.5 CUDA Version: 12.4... Zero-shot evaluations are conducted using the lm_eval library (version 0.4.4). |
| Experiment Setup | Yes | Figure 1: ...evaluated on the LLa MA 3.1-8B model using an RTX4090 GPU with a batch size of 1. ... we employ FLUTE with a codebook size of 23 and a group size of 64, resulting in an average bitwidth of 3.25. ... For evaluating language modeling performance, we measure perplexity on the Wiki Text2 dataset [37], using sequence lengths of 4096 tokens for LLa MA 2 models and 8192 tokens for LLa MA 3 models. |