Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Datamodels: Understanding Predictions with Data and Data with Predictions

Authors: Andrew Ilyas, Sung Min Park, Logan Engstrom, Guillaume Leclerc, Aleksander Madry

ICML 2022 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We then demonstrate that datamodels give rise to a variety of applications, such as: accurately predicting the effect of dataset counterfactuals; identifying brittle predictions; ﬁnding semantically similar examples; quantifying train-test leakage; and embedding data into a well-behaved and feature-rich representation space.
Researcher Affiliation	Academia	1MIT. Correspondence to: Andrew Ilyas <EMAIL>.
Pseudocode	No	The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code	No	The paper refers to using the FFCV library (Leclerc et al., 2022) with a GitHub link for that specific tool (https://github.com/libffcv/ffcv/), but it does not provide a direct statement or link for the open-source code of the datamodeling framework itself.
Open Datasets	Yes	We use the standard CIFAR-10 dataset (Krizhevsky, 2009). FMo W (Christie et al., 2018) is a land use classiﬁcation dataset based on satellite imagery. WILDS (Koh et al., 2020) uses a subset of FMo W and repurposes it as a benchmark for out-of-distribution (OOD) generalization; we use same the variant (presized to 224x224, single RGB image per example rather than a time sequence).
Dataset Splits	Yes	split the collected dataset of subset-output pairs into a datamodel training set of size m, a validation set of size mval, and a test set of size mtest; (e) estimate parameters θ by ﬁtting gθ on subset-output pairs, i.e., by minimizing i=1 L (gθ(1Si), f A(x; Si)) over the collected datamodel training set, and use the validation set to perform model selection.
Hardware Specification	Yes	We train our models on a cluster of machines, each with 9 NVIDIA A100 GPUs and 96 CPU cores.
Software Dependencies	No	The paper mentions using specific software components like "scikit-learn", "GLMNet", "Celer", and "FFCV library", but does not provide specific version numbers for these tools as used in their experiments.
Experiment Setup	Yes	Table A.2: Hyperparameters for used model class. Dataset Initial LR Batch Size Epochs Cyclic LR Peak Epoch Momentum Weight Decay CIFAR-10 0.5 512 24 5 0.9 5e-4 FMo W 0.4 512 15 6 0.9 1e-3