BREEDS: Benchmarks for Subpopulation Shift

Authors: Shibani Santurkar, Dimitris Tsipras, Aleksander Madry

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We develop a methodology for assessing the robustness of models to subpopulation shift specifically, their ability to generalize to novel data subpopulations that were not observed during training. Our approach leverages the class structure underlying existing datasets to control the data subpopulations that comprise the training and test distributions. This enables us to synthesize realistic distribution shifts whose sources can be precisely controlled and characterized, within existing large-scale datasets. Applying this methodology to the Image Net dataset, we create a suite of subpopulation shift benchmarks of varying granularity. We then validate that the corresponding shifts are tractable by obtaining human baselines. Finally, we utilize these benchmarks to measure the sensitivity of standard model architectures as well as the effectiveness of existing train-time robustness interventions.
Researcher Affiliation Academia Shibani Santurkar MIT shibani@mit.edu Dimitris Tsipras MIT tsipras@mit.edu Aleksander M adry MIT madry@mit.edu
Pseudocode Yes We present the pseudocode for this process in Algorithm 1.
Open Source Code Yes 1Code and data available at https://github.com/Madry Lab/BREEDS-Benchmarks.
Open Datasets Yes We perform our analysis on the ILSVRC2012 dataset (Russakovsky et al., 2015). This dataset contains a thousand classes from the Image Net dataset (Deng et al., 2009) with an independently collected validation set.
Dataset Splits Yes For all the BREEDS superclass classification tasks, the train and validation sets are obtained by aggregating the train and validation sets of the descendant Image Net classes (i.e., subpopulations). Specifically, for a given subpopulation, the training and test splits from the original Image Net dataset are used as is.
Hardware Specification No The paper states 'Due to computational constraints, we trained a restricted set of model architectures with robustness interventions Res Net-18 and Res Net-50 for adversarial training, and Res Net-18 and Res Net-34 for all others.' but does not provide specific details on the hardware used (e.g., CPU, GPU models, memory).
Software Dependencies No The paper mentions using 'standard implementations from the Py Torch library' and the 'robustness library' and 'Py Torch transforms' but does not specify version numbers for any of these software dependencies.
Experiment Setup Yes For training, we use a batch size of 128, weight decay of 10 4, and learning rates listed in Table 13. Models were trained until convergence. On ENTITY-13 and ENTITY-30, this required a total of 300 epochs, with 10-fold drops in learning rate every 100 epochs, while on LIVING-17and NON-LIVING-26, models a total of 450 epochs, with 10-fold learning rate drops every 150 epochs. For adapting models, we retrained the last (fully-connected) layer on the train split of the target domain, starting from the parameters of the source-trained model. We trained that layer using SGD with a batch size of 128 for 40,000 steps and chose the best learning rate out of [0.01, 0.1, 0.25, 0.5, 1.0, 2.0, 3.0, 5.0, 7.0, 8.0, 10.0, 11.0, 12.0], based on test accuracy.