Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Superposition Yields Robust Neural Scaling

Authors: Yizhou Liu, Ziming Liu, Jeff Gore

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We confirmed that open-sourced LLMs operate in the strong superposition regime and have loss scaling inversely with model dimension, and that the Chinchilla scaling laws are also consistent with this behavior. Our results identify representation superposition as a central driver of neural scaling laws, providing insights into questions like when neural scaling laws can be improved and when they will break down.1 ... We evaluate four open-sourced model classes, Opt [39], GPT2 [40], Qwen [41], and Pythia [42], which have model sizes from around 100M to 70B (evaluation details in Appendix C).
Researcher Affiliation Academia Yizhou Liu, Ziming Liu, and Jeff Gore Massachusetts Institute of Technology EMAIL
Pseudocode No The paper describes the architecture and loss of the toy model in Figure 2a and through equations, but it does not provide any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes 1Code is available at https://github.com/liuyz0/Superposition_Scaling
Open Datasets Yes We used the following publicly available datasets for evaluation: Wikitext-103: Standard English language modeling dataset. Pile-10k: A subset of The Pile, designed for diverse textual data. C4: Colossal Clean Crawled Corpus, containing large-scale web text. Book Corpus: Large-scale collection of books used for unsupervised learning. ... [43] Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models. ar Xiv preprint ar Xiv:1609.07843, 2016. [44] Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, et al. The pile: An 800gb dataset of diverse text for language modeling. ar Xiv preprint ar Xiv:2101.00027, 2020. [45] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21(140):1 67, 2020. [46] Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In Proceedings of the IEEE international conference on computer vision, pages 19 27, 2015.
Dataset Splits Yes The final test loss is calculated across newly sampled data with a size being 100 times the batch size. ...Datasets were streamed directly, efficiently sampling 10000 text segments with a maximum sequence length of 2048 tokens ( 2 107 tokens).
Hardware Specification Yes Device: Training performed using one V100 GPU, with floating-point precision (FP32) ...The simulations were performed in parallel using 96 CPU cores, where each core executed one distinct parameter combination defined by the weight decay and data exponent values.
Software Dependencies No The paper mentions using 'Adam W optimizer', 'Py Torch format', and 'Hugging Face’s AutoModelForCausalLM' but does not specify exact version numbers for these software components.
Experiment Setup Yes The hyperparameters are given as follows. Data dimension n: 10240 Model dimension m: Varied exponentially from 23 to 210 Batch size: 2048 Total training steps: 20000 Learning rate: Initially set to 0.02, scaled according to hidden dimension Weight decay: 1.0 for strong superposition, and 0.1 for weak superposition ...We employed the Adam W optimizer with distinct learning rates and weight decay settings for the weight matrix W and bias vector b. Specifically, for weight matrix W, learning rate was scaled as lr (8/m)0.25 with specified weight decay. And for bias vector b, a learning rate of 2.0/m was used with no weight decay. A cosine decay learning rate schedule with a warm-up phase (5% of total steps) was implemented. ...In the small toy models reported in Figures 3, 4, 5, 14, and 9, we set the hyperparameters as Feature dimension n: Fixed at 1000. Hidden dimension m: Varied logarithmically between 10 and 100 , across 6 distinct sizes, i.e., m = 10, 15, 25, 39, 63, 100. Batch size: 2048. Training steps: 20000 steps for each condition. Learning rate: Initialized at 1 10 2, dynamically adjusted using cosine decay scheduling with a warm-up phase of 2000 steps. Weight decay: Explored systematically from -1.0 to 1.0, in increments of 0.22 approximately (10 discrete values).