Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

DataRater: Meta-Learned Dataset Curation

Authors: Dan Andrei Calian, Greg Farquhar, Iurii Kemaev, Luisa Zintgraf, Matteo Hessel, Jeremy Shar, Junhyuk Oh, András György, Tom Schaul, Jeff Dean, Hado P van Hasselt, David Silver

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In extensive experiments across a range of model scales and datasets, we find that using our Data Rater to filter data is highly effective, resulting in significantly improved compute efficiency.
Researcher Affiliation	Industry	Dan A. Calian Gregory Farquhar Iurii Kemaev Luisa M. Zintgraf Matteo Hessel Jeremy Shar Junhyuk Oh András György Tom Schaul Jeffrey Dean Hado van Hasselt David Silver Google Deep Mind Correspondence to EMAIL.
Pseudocode	Yes	Algorithm 1 Meta-learning a Data Rater (ϕη).
Open Source Code	No	No code or models are publicly released.
Open Datasets	Yes	Our experiments utilise three datasets selected for their varying degrees of pre-filtering: C4 [Raffel et al., 2020], the most stringently filtered; C4/noclean, a less-filtered version of C4; and the Pile [Gao et al., 2020], representing the least filtered data. ... We measure accuracy on the following downstream tasks: Hella Swag [Zellers et al., 2019], SIQA [Sap et al., 2019], PIQA [Bisk et al., 2020], ARC Easy [Clark et al., 2018] and Commonsense QA [Talmor et al., 2019]
Dataset Splits	Yes	Evaluation. We measure negative log likelihood (NLL) on the validation set of each of the three input datasets, as well as on (English-language) Wikipedia. ... Importantly, for each input dataset, the inner models and their corresponding Data Rater model are optimised on disjoint subsets of that dataset’s training data.
Hardware Specification	Yes	We have run our experiments on Google TPUs: we used 4 × 4 × 4 topology v5 TPUs for meta-training, and up to 4 × 8 topology v6e TPUs for evaluation.
Software Dependencies	No	We implemented our infrastructure using the jax framework [Bradbury et al., 2018] and its ecosystem of associated libraries [Deep Mind et al., 2020]. The paper mentions software tools like 'jax framework' and its 'ecosystem of associated libraries' but does not specify their version numbers, which is required for reproducibility.
Experiment Setup	Yes	We train 50M, 150M, 400M and 1B models for 5k, 12k, 30k and 48k update steps respectively, with a 128 batch size (except for 1B for which we use a doubled batch size of 256), with a sequence length of 2048. For meta-training, for each inner LLM, we use a batch size of 128 to compute the inner loss, and an outer batch size of 128 to compute the outer loss. We used the Adam optimiser, with a decoupled weight decay of 0.1 [Loshchilov and Hutter, 2019], with 100 steps of linear warmup and global norm clipping [Pascanu et al., 2013] of 0.01.