Assaying Out-Of-Distribution Generalization in Transfer Learning

Authors: Florian Wenzel, Andrea Dittadi, Peter Gehler, Carl-Johann Simon-Gabriel, Max Horn, Dominik Zietlow, David Kernert, Chris Russell, Thomas Brox, Bernt Schiele, Bernhard Schölkopf, Francesco Locatello

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We fine-tune over 31k networks, from nine different architectures in the manyand few-shot setting. Our findings confirm that inand out-of-distribution accuracies tend to increase jointly, but show that their relation is largely dataset-dependent, and in general more nuanced and more complex than posited by previous, smaller scale studies1. (...) From 36 existing datasets, we extract 172 in-distribution (ID) and out-of-distribution (OOD) dataset pairs, fine-tuning and evaluating over 31k models to gain a broader insight in the sometimes contradicting statements on OOD robustness in previous research.
Researcher Affiliation Collaboration Florian Wenzel 1 Andrea Dittadi 2 Peter Gehler1 (...) 1 AWS Tübingen 2 Technical University of Denmark
Pseudocode No The paper does not contain any sections or figures explicitly labeled as 'Pseudocode' or 'Algorithm'.
Open Source Code Yes 1The code for the evaluation study is at github.com/amazon-research/assaying-ood.
Open Datasets Yes We evaluate nine state-of-the-art deep learning models with publicly available pre-trained weights for Image Net1k / ILSVRC2012 [24]. We consider 36 datasets grouped into ten different tasks sharing the same labels. (...) We extract 172 (ID, OOD) dataset pairs from the different domains of the ten tasks: Domain Net [83], PACS [84], SVIRO [85], Terra Incognita [13] as well as the Caltech101 [86], VLCS [87], Sun09 [88], VOC2007 [89] and the Wilds datasets [90] (from which we extract two tasks).
Dataset Splits Yes For each task, we take a single training dataset to fine-tune the model and report evaluation metrics on both its ID test set and all the other OOD test sets. (...) For the datasets from the WILDS benchmark, we use the provided ID test and OOD test splits.
Hardware Specification Yes See Appendix I.3, 17 GPU years on Nvidia T4 GPUs (cloud hosted)
Software Dependencies No Weights for the pre-trained models were taken from the Py Torch Image Models repository [100].
Experiment Setup Yes Models are fine-tuned on a single GPU using Adam [91] with a batch-size of 64 and a constant learning rate. (...) For each model we consider the learning rate and the number of fine-tuning epochs. We first ran a large sweep over these two hyperparameters on a subset of the experiments and used it to pre-select a set of four parameter combinations that included the best performing models for each architecture. Additionally, we study three different augmentation strategies: standard Image Net augmentation (i.e., no additional augmentation), Rand Augment [101] and Aug Mix [28].