Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Datasets, Documents, and Repetitions: The Practicalities of Unequal Data Quality
Authors: Alex Fang, Hadi Pouransari, Matt Jordan, Alexander Toshev, Vaishaal Shankar, Ludwig Schmidt, Tom Gunter
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We study model performance at various compute budgets and across multiple pre-training datasets created through data filtering and deduplication. We find that, given appropriate modifications to the training recipe, repeating existing aggressively filtered datasets for up to ten epochs can outperform training on the ten times larger superset for a single epoch across multiple compute budget orders of magnitude. |
| Researcher Affiliation | Collaboration | 1Apple 2Stanford 3work done while at UT Austin 4work done while at Apple |
| Pseudocode | Yes | Listing 1: Pseudocode to sample documents using count manipulation with a simple greedy function. |
| Open Source Code | Yes | code needed to reproduce experiments are publicly available unless specifically omitted for now to maintain anonymity. |
| Open Datasets | Yes | The main datasets we trained on are DCLM (CC-by-4.0), Refined Web (v2, from DCLM) (CC-by-4.0), and C4 (CC-by-4.0 or odc-by depending on source). |
| Dataset Splits | No | The paper discusses training models on datasets like DCLM, Refined Web, and C4 for specific token budgets (e.g., "138B tokens"), and evaluates them on standard benchmarks (e.g., "centered core metric from DCLM", "MMLU"). However, it does not provide explicit train/validation/test splits for the primary training datasets themselves, nor specific percentages or sample counts for such splits to reproduce data partitioning. |
| Hardware Specification | Yes | These models are trained on Nvidia H100s. We use AXLearn for 7B and 12B models in Section 3. These models are trained on TPUs. |
| Software Dependencies | No | The paper mentions using "Open LM (MIT license) for training" and "AXLearn for 7B and 12B models in Section 3", but it does not specify version numbers for these frameworks or any other software libraries or dependencies. |
| Experiment Setup | Yes | Table 1: Weight decay can reduce performance degradation from repeating data. All models are 12.6B parameters trained for 252B total tokens. We report results using the centered core metric. The default weight decay is 0.0316. |