Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Datamodels: Understanding Predictions with Data and Data with Predictions
Authors: Andrew Ilyas, Sung Min Park, Logan Engstrom, Guillaume Leclerc, Aleksander Madry
ICML 2022 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We then demonstrate that datamodels give rise to a variety of applications, such as: accurately predicting the effect of dataset counterfactuals; identifying brittle predictions; finding semantically similar examples; quantifying train-test leakage; and embedding data into a well-behaved and feature-rich representation space. |
| Researcher Affiliation | Academia | 1MIT. Correspondence to: Andrew Ilyas <EMAIL>. |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper refers to using the FFCV library (Leclerc et al., 2022) with a GitHub link for that specific tool (https://github.com/libffcv/ffcv/), but it does not provide a direct statement or link for the open-source code of the datamodeling framework itself. |
| Open Datasets | Yes | We use the standard CIFAR-10 dataset (Krizhevsky, 2009). FMo W (Christie et al., 2018) is a land use classification dataset based on satellite imagery. WILDS (Koh et al., 2020) uses a subset of FMo W and repurposes it as a benchmark for out-of-distribution (OOD) generalization; we use same the variant (presized to 224x224, single RGB image per example rather than a time sequence). |
| Dataset Splits | Yes | split the collected dataset of subset-output pairs into a datamodel training set of size m, a validation set of size mval, and a test set of size mtest; (e) estimate parameters θ by fitting gθ on subset-output pairs, i.e., by minimizing i=1 L (gθ(1Si), f A(x; Si)) over the collected datamodel training set, and use the validation set to perform model selection. |
| Hardware Specification | Yes | We train our models on a cluster of machines, each with 9 NVIDIA A100 GPUs and 96 CPU cores. |
| Software Dependencies | No | The paper mentions using specific software components like "scikit-learn", "GLMNet", "Celer", and "FFCV library", but does not provide specific version numbers for these tools as used in their experiments. |
| Experiment Setup | Yes | Table A.2: Hyperparameters for used model class. Dataset Initial LR Batch Size Epochs Cyclic LR Peak Epoch Momentum Weight Decay CIFAR-10 0.5 512 24 5 0.9 5e-4 FMo W 0.4 512 15 6 0.9 1e-3 |