Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
LLMs on the Line: Data Determines Loss-to-Loss Scaling Laws
Authors: Prasanna Mayilvahanan, Thaddรคus Wiedemer, Sayak Mallick, Matthias Bethge, Wieland Brendel
ICML 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments reveal that the pretraining data determines the scaling trend. In contrast, model size, optimization hyperparameters, tokenizer and even significant architectural differences, such as between transformer-based models like Llama and state-space models like Mamba, generally have limited impact. We make three main observations, illustrated in Fig. 1: 1. LLMs loss-to-loss scaling consistently follows shifted power laws. 2. Pretraining data is the most salient factor for these scaling laws. 3. In contrast, architecture and tokenizer generally play a minor role, while. model size, context length, and optimizer settings have little-to-no impact on loss-to-loss scaling. Our study systematically explores how multiple factors influence scaling laws across a diverse range of architectures and training configurations. |
| Researcher Affiliation | Academia | 1Max Planck Institute for Intelligent Systems 2ELLIS Institute T ubingen 3T ubingen AI Center 4University of T ubingen. |
| Pseudocode | No | The paper describes methods and analyses results but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The paper mentions using the Lingua framework (Videau et al., 2024) and cites its GitHub repository: 'URL https://github.com/facebookresearch/lingua'. However, it does not provide an explicit statement or link for the source code specific to the methodology described in this paper. |
| Open Datasets | Yes | Our models are trained on Fine Web Edu (Penedo et al., 2024), C4 (Dodge et al., 2021), and an uncopyrighted version of The Pile dubbed The Pile UC. Some models from Hugging Face are trained on the original version of The Pile (Gao et al., 2020) and The Pile Deduped (Biderman et al., 2023), a deduplicated version. |
| Dataset Splits | No | The paper mentions evaluating models on '5000 sequences sampled from the validation sets' of various datasets, but it does not provide specific train/test/validation split percentages or absolute sample counts for the main pretraining datasets like Fine Web-Edu, C4, or The Pile. |
| Hardware Specification | No | This research utilized compute resources at the T ubingen Machine Learning Cloud, DFG FKZ INST 37/1057-1 FUGG. This statement refers to a computing environment but does not specify hardware details like GPU/CPU models. |
| Software Dependencies | No | The paper mentions using 'Sci Py s default curve fit optimizer' and the 'LM Evaluation Harness framework (Gao et al., 2024)', as well as 'Hugging Face (Wolf et al., 2020)' models and specific tokenizers like 'tiktoken' and 'gpt2'. However, it does not provide specific version numbers for these software components. |
| Experiment Setup | Yes | We train Llama-3 (...) with 417 M parameters and Mamba (...) with 420 M parameters (...) following Chinchilla scaling laws (...). We collect 200 to 800 checkpoints throughout training, across model sizes, and for various seeds. We investigate varying the context length between 1024, 2048, and 3076 tokens. Optimization settings include Adam and AdamW optimizers, cosine and WSD schedules, learning rates of 3e-4 and 3e-3, and a weight decay of 0.1 or 3.3e-2. We maintain a constant warmup of 5000 steps, a learning rate of 3e-3, and a one-cycle cosine decay schedule. |