Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Aggregation Hides Out-of-Distribution Generalization Failures from Spurious Correlations

Authors: Olawale Salaudeen, Haoran Zhang, Kumail Alhamoud, Sara Beery, Marzyeh Ghassemi

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Benchmarks for out-of-distribution (OOD) generalization frequently show a strong positive correlation between in-distribution (ID) and OOD accuracy across models, termed accuracy-on-the-line. This pattern is often taken to imply that spurious correlations correlations that improve ID but reduce OOD performance are rare in practice. We find that this positive correlation is often an artifact of aggregating heterogeneous OOD examples. Using a simple gradient-based method, OODSelect, we identify semantically coherent OOD subsets where accuracy on the line does not hold. Across widely used distribution shift benchmarks, the OODSelect uncovers subsets, sometimes up to over half of the standard OOD set, where higher ID accuracy predicts lower OOD accuracy. Our findings indicate that aggregate metrics can obscure important failure modes of OOD robustness. We release code and the identified subsets to facilitate further research. 1 Introduction Benchmarks for out-of-distribution (OOD) generalization have shown a consistent pattern that models performing well on the training distribution also perform well out-of-distribution, a trend known as accuracy-on-the-line (Ao TL) (Miller et al., 2021; Taori et al., 2020). This pattern has often been interpreted as evidence that spurious correlations features that improve in-distribution (ID) accuracy but harm OOD performance are uncommon in practice. We show that this apparent robustness is misleading. When OOD data are disaggregated, large and semantically coherent subsets emerge where higher ID accuracy predicts lower OOD accuracy, a phenomenon we term accuracy-on-the-inverse-line (Ao TIL). These hidden subsets reveal that aggregation can mask major failures of OOD robustness, suggesting that existing benchmarks may underestimate the prevalence and impact of spurious correlations. 4 Experiments Procedure. Table 1 summarizes the datasets we study. Given a typical distribution shift benchmark with at least two domains, i.e., D = {D1, D2, . . .}, we fix a DID, DOOD D pair, which are disjoint sets (concatenated) of domains. This pair denotes an experimental setting. In this work, we focus on the standard DID, DOOD splits the community uses for each dataset (Gulrajani and Lopez-Paz, 2020; Koh et al., 2021). For each split, we apply our methodology to identify subsets Ds OOD with Ao TIL Appendix A Algorithm 1. 5 Empirical Results and Discussion Findings. Overall, we find that many benchmarks contain OODSelect subsets of examples that exhibit Ao TIL or a weak correlation, though the size of such subsets varies. The same benchmarks exhibit Ao TL when all OOD samples are aggregated (Figure 2).
Researcher Affiliation Academia Olawale Salaudeen Haoran Zhang Kumail Alhamoud Sara Beery Marzyeh Ghassemi Massachusetts Institute of Technology Correspondence to EMAIL.
Pseudocode Yes Algorithm 1: OODSelect: Selecting OOD subsets without accuracy-on-the-line Input: Dtrain ID , Dtest ID : in-distribution train/test splits; DOOD: out-of-distribution dataset; S N |DOOD|: number of OOD samples to select Output: Subset Ds OOD DOOD of size S
Open Source Code Yes We provide the code and selected subsets1 for our proposed OOD selection method and analysis. 1https://github.com/olawalesalaudeen/OODSELECT
Open Datasets Yes Across widely used distribution shift benchmarks, the OODSelect uncovers subsets... We release code and the identified subsets to facilitate further research. [...] We consider real-world tasks and distributions such as predicting Finding / No Finding from Chest X-rays where domains ID domains are from Che Xpert (v1.0-small) (Irvin et al., 2019), Chest Xray8 (Wang et al., 2017), Pad Chest (Bustos et al., 2020), and Vin Dr-CXR (Nguyen et al., 2022). The OOD domain is MIMIC-CXR-JPG (Johnson et al., 2019). We also study WILDS (Koh et al., 2021) benchmarks that capture real-world shifts. WILDS-Camelyon (Bandi et al., 2018) targets cancer detection from histopathology slides across hospitals. WILDS-Civil Comments (Borkan et al., 2019; Koh et al., 2021) classifies online comments as toxic or non-toxic across demographic subgroups, with OOD domains defined by shifts in identity attributes such as gender, religion, and race. We also study Domain Bed (Gulrajani and Lopez-Paz, 2020) benchmarks reflecting different forms of distribution shift: style, dataset collection, and environment. PACS (Li et al., 2017) involves object classification across artistic styles (7 classes across Photo, Art Painting, Cartoon, and Sketch), with Sketch as OOD. VLCS (Fang et al., 2013) spans 5 classes across 4 datasets (VOC2007 (Everingham et al., 2010), Label Me (Russell et al., 2008), Caltech101 (Fei-Fei et al., 2004), and SUN09 (Choi et al., 2010)), capturing collection biases; Label Me is OOD. Terra Incognita (Beery et al., 2018) focuses on wildlife recognition across 4 geographic locations, with L46 as OOD.
Dataset Splits Yes To evaluate generalization, we randomly partition the same set of models into train, validation, and test splits (60/20/20). We optimize our selection objective on the training split and identify the best-performing OODSelect configuration using the held-out validation split. Final results are reported on the test split.
Hardware Specification Yes Table 3: Compute time to reproduce experiments (GPU Hours) per experiment unit on NVIDIA RTX A6000 GPUs.
Software Dependencies Yes We use the Adam optimizer (Kingma and Ba, 2014) to optimize Equation 5. We use a cosine annealing schedule to adjust the learning rate and λ (Loshchilov and Hutter, 2016). [...] We use Qwen2.5-32B-Instruct (Yang et al., 2024b; Team, 2024). [...] AIMV2-large-patch14-224-lit (Fini et al., 2024). [...] CLIP (Radford et al., 2021). [...] BLIP-2 Li et al. (2023) [...] Mixtral Jiang et al. (2024).
Experiment Setup Yes We construct a diverse population of models by varying architecture (from VGG to Vision Transformers, listed below), pretraining weights (Torch Vision maintainers and contributors, 2016; Deng et al., 2009; He et al., 2019), initialization (from scratch and transfer learning), and hyperparameters. We train up to 4200 models (Figure 7) with various vision architectures, including variants of Res Nets (He et al., 2016), Dense Nets (Huang et al., 2017), Mobile Nets (Howard, 2017), Vi T (Dosovitskiy et al., 2020), VGG (Simonyan and Zisserman, 2014), and Inception (Szegedy et al., 2015). We do the same for our language experiments, from BERT (Devlin et al., 2019) to GPT-2 (Radford et al., 2019). A full list of models is provided in Appendix A. [...] To evaluate generalization, we randomly partition the same set of models into train, validation, and test splits (60/20/20). We optimize our selection objective on the training split and identify the best-performing OODSelect configuration using the held-out validation split. Final results are reported on the test split. [...] We use the Adam optimizer (Kingma and Ba, 2014) to optimize Equation 5. We use a cosine annealing schedule to adjust the learning rate and λ (Loshchilov and Hutter, 2016).