Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

LLMs on the Line: Data Determines Loss-to-Loss Scaling Laws

Authors: Prasanna Mayilvahanan, Thaddäus Wiedemer, Sayak Mallick, Matthias Bethge, Wieland Brendel

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments reveal that the pretraining data determines the scaling trend. In contrast, model size, optimization hyperparameters, tokenizer and even significant architectural differences, such as between transformer-based models like Llama and state-space models like Mamba, generally have limited impact. We make three main observations, illustrated in Fig. 1: 1. LLMs loss-to-loss scaling consistently follows shifted power laws. 2. Pretraining data is the most salient factor for these scaling laws. 3. In contrast, architecture and tokenizer generally play a minor role, while. model size, context length, and optimizer settings have little-to-no impact on loss-to-loss scaling. Our study systematically explores how multiple factors influence scaling laws across a diverse range of architectures and training configurations.
Researcher Affiliation	Academia	1Max Planck Institute for Intelligent Systems 2ELLIS Institute T ubingen 3T ubingen AI Center 4University of T ubingen.
Pseudocode	No	The paper describes methods and analyses results but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	No	The paper mentions using the Lingua framework (Videau et al., 2024) and cites its GitHub repository: 'URL https://github.com/facebookresearch/lingua'. However, it does not provide an explicit statement or link for the source code specific to the methodology described in this paper.
Open Datasets	Yes	Our models are trained on Fine Web Edu (Penedo et al., 2024), C4 (Dodge et al., 2021), and an uncopyrighted version of The Pile dubbed The Pile UC. Some models from Hugging Face are trained on the original version of The Pile (Gao et al., 2020) and The Pile Deduped (Biderman et al., 2023), a deduplicated version.
Dataset Splits	No	The paper mentions evaluating models on '5000 sequences sampled from the validation sets' of various datasets, but it does not provide specific train/test/validation split percentages or absolute sample counts for the main pretraining datasets like Fine Web-Edu, C4, or The Pile.
Hardware Specification	No	This research utilized compute resources at the T ubingen Machine Learning Cloud, DFG FKZ INST 37/1057-1 FUGG. This statement refers to a computing environment but does not specify hardware details like GPU/CPU models.
Software Dependencies	No	The paper mentions using 'Sci Py s default curve fit optimizer' and the 'LM Evaluation Harness framework (Gao et al., 2024)', as well as 'Hugging Face (Wolf et al., 2020)' models and specific tokenizers like 'tiktoken' and 'gpt2'. However, it does not provide specific version numbers for these software components.
Experiment Setup	Yes	We train Llama-3 (...) with 417 M parameters and Mamba (...) with 420 M parameters (...) following Chinchilla scaling laws (...). We collect 200 to 800 checkpoints throughout training, across model sizes, and for various seeds. We investigate varying the context length between 1024, 2048, and 3076 tokens. Optimization settings include Adam and AdamW optimizers, cosine and WSD schedules, learning rates of 3e-4 and 3e-3, and a weight decay of 0.1 or 3.3e-2. We maintain a constant warmup of 5000 steps, a learning rate of 3e-3, and a one-cycle cosine decay schedule.