Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Unified Scaling Laws for Compressed Representations

Authors: Andrei Panferov, Alexandra Volkova, Ionut-Vlad Modoranu, Vage Egiazarian, Mher Safaryan, Dan Alistarh

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental This paper investigates the interplay between scaling laws and compression formats, exploring whether a unified scaling framework can accurately predict model performance when training occurs over various compressed representations, such as sparse, scalar-quantized, sparse-quantized or even vectorquantized formats. Our key contributions include validating a general scaling law formulation and showing that it is applicable both individually but also composably across compression types. Based on this, our main finding is demonstrating both theoretically and empirically that there exists a simple capacity metric based on the representation s ability to fit random Gaussian data which can robustly predict parameter efficiency across multiple compressed representations. On the practical side, we extend our formulation to directly compare the accuracy potential of different compressed formats, and to derive better algorithms for training over sparse-quantized formats.
Researcher Affiliation Collaboration Andrei Panferov ISTA Alexandra Volkova ISTA Ionut-Vlad Modoranu ISTA Vage Egiazarian ISTA Mher Safaryan ISTA Dan Alistarh ISTA & Red Hat AI Correspondence to: EMAIL.
Pseudocode Yes Algorithm 1 VQ Training Forward Algorithm 2 VQ Training Backward Algorithm 3 Adam with Straight Through Estimation (STE) and AMSGrad normalization
Open Source Code Yes Our source code is available at: IST-DASLab/unifiedsc-laws
Open Datasets Yes For our scaling law investigations, we pretrained decoder-only Transformers following the Llama architecture [34] for 30M, 50M, 100M and 200M non-embedding parameters. The models were trained on the C4 dataset [28], using the Llama-2 tokenizer [34].
Dataset Splits No The models were trained on the C4 dataset [28], using the Llama-2 tokenizer [34]. To ensure we operate in a data-rich regime, we use 50, 100, and 200 training tokens per model parameter for each training configuration, and train on fixed-length context windows of 512 tokens.
Hardware Specification Yes We use 8x80GB H100 machines for efficient training, and training one model takes on average 1 hour.
Software Dependencies No The paper mentions software components like "Adam W", "Llama architecture", "Llama-2 tokenizer", "PyTorch", "scipy.stats.norm.ppf", but does not provide specific version numbers for any of these.
Experiment Setup Yes For our scaling law investigations, we pretrained decoder-only Transformers following the Llama architecture [34] for 30M, 50M, 100M and 200M non-embedding parameters. The models were trained on the C4 dataset [28], using the Llama-2 tokenizer [34]. To ensure we operate in a data-rich regime, we use 50, 100, and 200 training tokens per model parameter for each training configuration, and train on fixed-length context windows of 512 tokens. We used Adam W [18; 23] with a 0.1 ratio of warm-up epochs with cosine scheduler. Our experimental setup is very similar to that of [9; 19; 10]. More details are provided in Appendix A. Table 2: Key architectural and training hyperparameters for Llama family models. (in Appendix A, which includes Model size, # Layers, # Heads, # Embeddings, Learning rate)