reproducibilityindex.ai

Data Feedback Loops: Model-driven Amplification of Dataset Biases

Authors: Rohan Taori, Tatsunori Hashimoto

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments in three conditional prediction scenarios image classification, visual role-labeling, and language generation demonstrate that models that exhibit a sampling-like behavior are more faithful and thus more stable. Empirically, we demonstrate the utility of our bias amplification bounds in three different natural experiment settings:
Researcher Affiliation	Academia	1Stanford University. Correspondence to: Rohan Taori <rtaori@stanford.edu>.
Pseudocode	Yes	Algorithm 1 Data Feedback Procedure
Open Source Code	Yes	We release all code and data for this project at https://github.com/rtaori/data_feedback.
Open Datasets	Yes	We use the CIFAR-5m dataset (Nakkiran et al., 2021), We run data feedback on the im Situ dataset (Yatskar et al., 2016), We use the Real Toxicity Prompts dataset (Gehman et al., 2020)
Dataset Splits	Yes	We also combine all data splits (train, dev, and test), and randomly sample 50 images per category (for a total of 25200 examples) to create a test set for each new experiment run. The optimization criterion was the average score of five metrics calculated over the given dev set:
Hardware Specification	No	No specific hardware details (e.g., GPU/CPU models, memory) used for running experiments are provided.
Software Dependencies	No	No specific software dependencies with version numbers are explicitly listed.
Experiment Setup	Yes	For most experiments, we train a Baidu Net9 (Li et al., 2019), which has 94% accuracy when trained on CIFAR-10. We optimize the model using stochastic gradient descent with a batch size of 512, Nesterov momentum factor of 0.9, and weight decay of 0.256. The number of epochs trained is dependent on dataset size: below 20k examples, we train for 63 epochs, then linearly scaled down to 50 epochs at 50k examples, then linearly scaled down to 38 epochs at 100k examples, then linearly scaled down to 25 epochs at 1m or more examples. We use a triangular learning rate: for the first fifth of training time, the learning rate is scaled linearly up from 0 until 0.4 and then, for the rest of training time, scaled linearly back down to 0.001.