Datamodels: Understanding Predictions with Data and Data with Predictions
Authors: Andrew Ilyas, Sung Min Park, Logan Engstrom, Guillaume Leclerc, Aleksander Madry
ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We then demonstrate that datamodels give rise to a variety of applications, such as: accurately predicting the effect of dataset counterfactuals; identifying brittle predictions; finding semantically similar examples; quantifying train-test leakage; and embedding data into a well-behaved and feature-rich representation space. |
| Researcher Affiliation | Academia | 1MIT. Correspondence to: Andrew Ilyas <ailyas@mit.edu>. |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper refers to using the FFCV library (Leclerc et al., 2022) with a GitHub link for that specific tool (https://github.com/libffcv/ffcv/), but it does not provide a direct statement or link for the open-source code of the datamodeling framework itself. |
| Open Datasets | Yes | We use the standard CIFAR-10 dataset (Krizhevsky, 2009). FMo W (Christie et al., 2018) is a land use classification dataset based on satellite imagery. WILDS (Koh et al., 2020) uses a subset of FMo W and repurposes it as a benchmark for out-of-distribution (OOD) generalization; we use same the variant (presized to 224x224, single RGB image per example rather than a time sequence). |
| Dataset Splits | Yes | split the collected dataset of subset-output pairs into a datamodel training set of size m, a validation set of size mval, and a test set of size mtest; (e) estimate parameters θ by fitting gθ on subset-output pairs, i.e., by minimizing i=1 L (gθ(1Si), f A(x; Si)) over the collected datamodel training set, and use the validation set to perform model selection. |
| Hardware Specification | Yes | We train our models on a cluster of machines, each with 9 NVIDIA A100 GPUs and 96 CPU cores. |
| Software Dependencies | No | The paper mentions using specific software components like "scikit-learn", "GLMNet", "Celer", and "FFCV library", but does not provide specific version numbers for these tools as used in their experiments. |
| Experiment Setup | Yes | Table A.2: Hyperparameters for used model class. Dataset Initial LR Batch Size Epochs Cyclic LR Peak Epoch Momentum Weight Decay CIFAR-10 0.5 512 24 5 0.9 5e-4 FMo W 0.4 512 15 6 0.9 1e-3 |