Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Datasets, Documents, and Repetitions: The Practicalities of Unequal Data Quality

Authors: Alex Fang, Hadi Pouransari, Matt Jordan, Alexander Toshev, Vaishaal Shankar, Ludwig Schmidt, Tom Gunter

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We study model performance at various compute budgets and across multiple pre-training datasets created through data filtering and deduplication. We find that, given appropriate modifications to the training recipe, repeating existing aggressively filtered datasets for up to ten epochs can outperform training on the ten times larger superset for a single epoch across multiple compute budget orders of magnitude.
Researcher Affiliation Collaboration 1Apple 2Stanford 3work done while at UT Austin 4work done while at Apple
Pseudocode Yes Listing 1: Pseudocode to sample documents using count manipulation with a simple greedy function.
Open Source Code Yes code needed to reproduce experiments are publicly available unless specifically omitted for now to maintain anonymity.
Open Datasets Yes The main datasets we trained on are DCLM (CC-by-4.0), Refined Web (v2, from DCLM) (CC-by-4.0), and C4 (CC-by-4.0 or odc-by depending on source).
Dataset Splits No The paper discusses training models on datasets like DCLM, Refined Web, and C4 for specific token budgets (e.g., "138B tokens"), and evaluates them on standard benchmarks (e.g., "centered core metric from DCLM", "MMLU"). However, it does not provide explicit train/validation/test splits for the primary training datasets themselves, nor specific percentages or sample counts for such splits to reproduce data partitioning.
Hardware Specification Yes These models are trained on Nvidia H100s. We use AXLearn for 7B and 12B models in Section 3. These models are trained on TPUs.
Software Dependencies No The paper mentions using "Open LM (MIT license) for training" and "AXLearn for 7B and 12B models in Section 3", but it does not specify version numbers for these frameworks or any other software libraries or dependencies.
Experiment Setup Yes Table 1: Weight decay can reduce performance degradation from repeating data. All models are 12.6B parameters trained for 252B total tokens. We report results using the centered core metric. The default weight decay is 0.0316.