reproducibilityindex.ai

BREEDS: Benchmarks for Subpopulation Shift

Authors: Shibani Santurkar, Dimitris Tsipras, Aleksander Madry

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We develop a methodology for assessing the robustness of models to subpopulation shift speciﬁcally, their ability to generalize to novel data subpopulations that were not observed during training. Our approach leverages the class structure underlying existing datasets to control the data subpopulations that comprise the training and test distributions. This enables us to synthesize realistic distribution shifts whose sources can be precisely controlled and characterized, within existing large-scale datasets. Applying this methodology to the Image Net dataset, we create a suite of subpopulation shift benchmarks of varying granularity. We then validate that the corresponding shifts are tractable by obtaining human baselines. Finally, we utilize these benchmarks to measure the sensitivity of standard model architectures as well as the effectiveness of existing train-time robustness interventions.
Researcher Affiliation	Academia	Shibani Santurkar MIT shibani@mit.edu Dimitris Tsipras MIT tsipras@mit.edu Aleksander M adry MIT madry@mit.edu
Pseudocode	Yes	We present the pseudocode for this process in Algorithm 1.
Open Source Code	Yes	1Code and data available at https://github.com/Madry Lab/BREEDS-Benchmarks.
Open Datasets	Yes	We perform our analysis on the ILSVRC2012 dataset (Russakovsky et al., 2015). This dataset contains a thousand classes from the Image Net dataset (Deng et al., 2009) with an independently collected validation set.
Dataset Splits	Yes	For all the BREEDS superclass classiﬁcation tasks, the train and validation sets are obtained by aggregating the train and validation sets of the descendant Image Net classes (i.e., subpopulations). Specifically, for a given subpopulation, the training and test splits from the original Image Net dataset are used as is.
Hardware Specification	No	The paper states 'Due to computational constraints, we trained a restricted set of model architectures with robustness interventions Res Net-18 and Res Net-50 for adversarial training, and Res Net-18 and Res Net-34 for all others.' but does not provide specific details on the hardware used (e.g., CPU, GPU models, memory).
Software Dependencies	No	The paper mentions using 'standard implementations from the Py Torch library' and the 'robustness library' and 'Py Torch transforms' but does not specify version numbers for any of these software dependencies.
Experiment Setup	Yes	For training, we use a batch size of 128, weight decay of 10 4, and learning rates listed in Table 13. Models were trained until convergence. On ENTITY-13 and ENTITY-30, this required a total of 300 epochs, with 10-fold drops in learning rate every 100 epochs, while on LIVING-17and NON-LIVING-26, models a total of 450 epochs, with 10-fold learning rate drops every 150 epochs. For adapting models, we retrained the last (fully-connected) layer on the train split of the target domain, starting from the parameters of the source-trained model. We trained that layer using SGD with a batch size of 128 for 40,000 steps and chose the best learning rate out of [0.01, 0.1, 0.25, 0.5, 1.0, 2.0, 3.0, 5.0, 7.0, 8.0, 10.0, 11.0, 12.0], based on test accuracy.