reproducibilityindex.ai

Change is Hard: A Closer Look at Subpopulation Shift

Authors: Yuzhe Yang, Haoran Zhang, Dina Katabi, Marzyeh Ghassemi

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We then establish a comprehensive benchmark of 20 state-of-the-art algorithms evaluated on 12 real-world datasets in vision, language, and healthcare domains. With results obtained from training over 10,000 models, we reveal intriguing observations for future progress in this space.
Researcher Affiliation	Academia	Yuzhe Yang * 1 Haoran Zhang * 1 Dina Katabi 1 Marzyeh Ghassemi 1 Machine learning models often perform poorly on subgroups that are underrepresented in the training data. Yet, little is understood on the variation in mechanisms that cause subpopulation shifts, and how algorithms generalize across such diverse shifts at scale. In this work, we provide a fine-grained analysis of subpopulation shift. We first propose a unified framework that dissects and explains common shifts in subgroups. We then establish a comprehensive benchmark of 20 stateof-the-art algorithms evaluated on 12 real-world datasets in vision, language, and healthcare domains. With results obtained from training over 10,000 models, we reveal intriguing observations for future progress in this space. First, existing algorithms only improve subgroup robustness over certain types of shifts but not others. Moreover, while current algorithms rely on group-annotated validation data for model selection, we find that a simple selection criterion based on worst-class accuracy is surprisingly effective even without any group information. Finally, unlike existing works that solely aim to improve worst-group accuracy (WGA), we demonstrate the fundamental tradeoff between WGA and other important metrics, highlighting the need to carefully choose testing metrics. Code and data are available at: https: //github.com/Yyz Harry/Subpop Bench. 1. Introduction Machine learning models frequently exhibit drops in performance under the presence of distribution shifts (Quinonero Candela et al., 2008). Constructing machine learning models that are robust to these shifts is critical to the safe deployment of such models in the real-world (Amodei et al., 2016). One ubiquitous type of distribution shift is subpopulation shift, which is characterized by changes in the proportion of some subpopulations between training and deployment *Equal contribution 1MIT CSAIL. Correspondence to: Yuzhe Yang <yuzhe@mit.edu>.
Pseudocode	No	The paper does not contain any sections or figures explicitly labeled as 'Pseudocode' or 'Algorithm', nor are there structured code-like procedural descriptions.
Open Source Code	Yes	Code and data are available at: https: //github.com/Yyz Harry/Subpop Bench.
Open Datasets	Yes	We explore subpopulation shift using 12 real-world datasets from a variety of modalities and tasks. First, for vision datasets, we use Waterbirds (Wah et al., 2011) and Celeb A (Liu et al., 2015), which are commonly used in the spurious correlation literature (Liu et al., 2021).
Dataset Splits	Yes	We randomly split the dataset into 85% train, 5% validation, and 10% test splits.
Hardware Specification	No	The paper does not explicitly describe the specific hardware (e.g., GPU models, CPU types, or cloud computing instances with specifications) used to run the experiments.
Software Dependencies	No	The paper mentions specific optimizers like AdamW and SGD, and pretrained models like ResNet-50 and BERT. However, it does not provide specific version numbers for key software components or libraries (e.g., Python, PyTorch, TensorFlow versions) used in the experimental setup.
Experiment Setup	Yes	We train all models for 5,000 steps on Waterbirds and Meta Shift, 10,000 steps on MIMICNotes and Image Net BG, 20,000 steps on Che Xpert and CXRMultisite, and 30,000 steps on all other datasets to ensure convergence. For a fair evaluation across different algorithms, following the training protocol in (Gulrajani & Lopez-Paz, 2021), for each algorithm we conduct a random search of 16 trials over a joint distribution of its all hyperparameters. We then use the validation set to select the best hyperparameters for each algorithm, fix them and rerun the experiments under 3 different random seeds to report the final average results (and standard deviation).