Change is Hard: A Closer Look at Subpopulation Shift
Authors: Yuzhe Yang, Haoran Zhang, Dina Katabi, Marzyeh Ghassemi
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We then establish a comprehensive benchmark of 20 state-of-the-art algorithms evaluated on 12 real-world datasets in vision, language, and healthcare domains. With results obtained from training over 10,000 models, we reveal intriguing observations for future progress in this space. |
| Researcher Affiliation | Academia | Yuzhe Yang * 1 Haoran Zhang * 1 Dina Katabi 1 Marzyeh Ghassemi 1 Machine learning models often perform poorly on subgroups that are underrepresented in the training data. Yet, little is understood on the variation in mechanisms that cause subpopulation shifts, and how algorithms generalize across such diverse shifts at scale. In this work, we provide a fine-grained analysis of subpopulation shift. We first propose a unified framework that dissects and explains common shifts in subgroups. We then establish a comprehensive benchmark of 20 stateof-the-art algorithms evaluated on 12 real-world datasets in vision, language, and healthcare domains. With results obtained from training over 10,000 models, we reveal intriguing observations for future progress in this space. First, existing algorithms only improve subgroup robustness over certain types of shifts but not others. Moreover, while current algorithms rely on group-annotated validation data for model selection, we find that a simple selection criterion based on worst-class accuracy is surprisingly effective even without any group information. Finally, unlike existing works that solely aim to improve worst-group accuracy (WGA), we demonstrate the fundamental tradeoff between WGA and other important metrics, highlighting the need to carefully choose testing metrics. Code and data are available at: https: //github.com/Yyz Harry/Subpop Bench. 1. Introduction Machine learning models frequently exhibit drops in performance under the presence of distribution shifts (Quinonero Candela et al., 2008). Constructing machine learning models that are robust to these shifts is critical to the safe deployment of such models in the real-world (Amodei et al., 2016). One ubiquitous type of distribution shift is subpopulation shift, which is characterized by changes in the proportion of some subpopulations between training and deployment *Equal contribution 1MIT CSAIL. Correspondence to: Yuzhe Yang <yuzhe@mit.edu>. |
| Pseudocode | No | The paper does not contain any sections or figures explicitly labeled as 'Pseudocode' or 'Algorithm', nor are there structured code-like procedural descriptions. |
| Open Source Code | Yes | Code and data are available at: https: //github.com/Yyz Harry/Subpop Bench. |
| Open Datasets | Yes | We explore subpopulation shift using 12 real-world datasets from a variety of modalities and tasks. First, for vision datasets, we use Waterbirds (Wah et al., 2011) and Celeb A (Liu et al., 2015), which are commonly used in the spurious correlation literature (Liu et al., 2021). |
| Dataset Splits | Yes | We randomly split the dataset into 85% train, 5% validation, and 10% test splits. |
| Hardware Specification | No | The paper does not explicitly describe the specific hardware (e.g., GPU models, CPU types, or cloud computing instances with specifications) used to run the experiments. |
| Software Dependencies | No | The paper mentions specific optimizers like AdamW and SGD, and pretrained models like ResNet-50 and BERT. However, it does not provide specific version numbers for key software components or libraries (e.g., Python, PyTorch, TensorFlow versions) used in the experimental setup. |
| Experiment Setup | Yes | We train all models for 5,000 steps on Waterbirds and Meta Shift, 10,000 steps on MIMICNotes and Image Net BG, 20,000 steps on Che Xpert and CXRMultisite, and 30,000 steps on all other datasets to ensure convergence. For a fair evaluation across different algorithms, following the training protocol in (Gulrajani & Lopez-Paz, 2021), for each algorithm we conduct a random search of 16 trials over a joint distribution of its all hyperparameters. We then use the validation set to select the best hyperparameters for each algorithm, fix them and rerun the experiments under 3 different random seeds to report the final average results (and standard deviation). |