Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Data-centric Machine Learning Research (DMLR) - 2024

Documentation Rate of Empirical Papers by Reproducibility Variable

Distribution of Empirical Papers by Number of Documented Variables

Website:

Venue	Year	Papers	Reproducibility Score Reproducibility Score based on Gundersen et al. (2025). See Methods for details.	Documentation Score Documentation Score is the average score over the seven reproducibility variables for empirical research papers. See Methods for details.	% Empirical Percentage of papers that are empirical research vs theoretical research.	% Industry Percentage of empirical research papers with at least one author from Industry.	Website
DMLR	2024	27	0.71	4.4	92.59%	56.0%

Search Papers

	Pseudocode	Open Source Code	Open Datasets	Dataset Splits	Hardware Specification	Software Dependencies	Experiment Setup
ATCO2 corpus: A Large-Scale Dataset for Research on Automatic Speech Recognition and Natural Language Understanding of Air Traffic Control Communications	❌	✅	✅	✅	✅	❌	✅	5
Benchmarking Edge Regression on Temporal Networks	❌	❌	✅	✅	✅	❌	✅	4
Benchmarking Robustness of Multimodal Image-Text Models under Distribution Shift	❌	✅	✅	✅	❌	❌	✅	4
Building Better Datasets: Seven Recommendations for Responsible Design from Dataset Creators	❌	❌	❌	❌	❌	❌	❌	0
ComPile: A Large IR Dataset from Production Sources	❌	✅	✅	❌	❌	✅	✅	4
DMLR: Data-centric Machine Learning Research - Past, Present and Future	❌	❌	✅	❌	❌	❌	❌	1
Datasets and Benchmarks for Offline Safe Reinforcement Learning	❌	✅	✅	❌	✅	❌	✅	4
Deep Neural Network Benchmarks for Selective Classification	❌	✅	✅	✅	✅	❌	✅	5
Detecting Errors in a Numerical Response via any Regression Model	✅	✅	✅	✅	❌	❌	❌	4
Evaluating Durability: Benchmark Insights into Image and Text Watermarking	❌	✅	✅	✅	✅	❌	✅	5
FedAIoT: A Federated Learning Benchmark for Artificial Intelligence of Things	❌	✅	✅	✅	✅	❌	✅	5
Forecasting Electric Vehicle Charging Station Occupancy: Smarter Mobility Data Challenge	❌	✅	✅	✅	❌	❌	✅	4
GlycoNMR: Dataset and Benchmark of Carbohydrate-Specific NMR Chemical Shift for Machine Learning Research	❌	✅	✅	✅	✅	❌	✅	5
Highlighting Challenges of State-of-the-Art Semantic Segmentation with HAIR - A Dataset of Historical Aerial Images	❌	✅	✅	✅	✅	❌	✅	5
LabelBench: A Comprehensive Framework for Benchmarking Adaptive Label-Efficient Learning	❌	✅	✅	✅	✅	❌	✅	5
NAFlora-1M: Continental-Scale High-Resolution Fine-Grained Plant Classification Dataset	❌	✅	✅	✅	✅	❌	✅	5
On Catastrophic Inheritance of Large Foundation Models	❌	❌	❌	❌	❌	❌	❌	0
On Minimizing the Training Set Fill Distance in Machine Learning Regression	✅	✅	✅	✅	✅	❌	✅	6
OpenOOD v1.5: Enhanced Benchmark for Out-of-Distribution Detection	❌	✅	✅	✅	✅	❌	✅	5
Potion: Towards Poison Unlearning	✅	❌	✅	✅	✅	❌	✅	5
Properties of Alternative Data for Fairer Credit Risk Predictions	❌	❌	❌	✅	❌	✅	✅	3
Rethinking Symbolic Regression Datasets and Benchmarks for Scientific Discovery	❌	✅	✅	✅	✅	❌	✅	5
The Matrix Reloaded: Towards Counterfactual Group Fairness in Machine Learning	✅	❌	✅	✅	✅	❌	✅	5
The Nine Lives of ImageNet: A Sociotechnical Retrospective of a Foundation Dataset and the Limits of Automated Essentialism	❌	❌	✅	❌	❌	❌	❌	1
VALUED - Vision and Logical Understanding Evaluation Dataset	❌	✅	✅	✅	✅	❌	✅	5
When is Off-Policy Evaluation (Reward Modeling) Useful in Contextual Bandits? A Data-Centric Perspective	✅	✅	✅	✅	✅	❌	✅	6
You can't handle the (dirty) truth: Data-centric Insights Improve Pseudo-Labeling	✅	✅	✅	✅	✅	❌	❌	5