reproducibilityindex.ai

Assaying Out-Of-Distribution Generalization in Transfer Learning

Authors: Florian Wenzel, Andrea Dittadi, Peter Gehler, Carl-Johann Simon-Gabriel, Max Horn, Dominik Zietlow, David Kernert, Chris Russell, Thomas Brox, Bernt Schiele, Bernhard Schölkopf, Francesco Locatello

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We fine-tune over 31k networks, from nine different architectures in the manyand few-shot setting. Our findings confirm that inand out-of-distribution accuracies tend to increase jointly, but show that their relation is largely dataset-dependent, and in general more nuanced and more complex than posited by previous, smaller scale studies1. (...) From 36 existing datasets, we extract 172 in-distribution (ID) and out-of-distribution (OOD) dataset pairs, fine-tuning and evaluating over 31k models to gain a broader insight in the sometimes contradicting statements on OOD robustness in previous research.
Researcher Affiliation	Collaboration	Florian Wenzel 1 Andrea Dittadi 2 Peter Gehler1 (...) 1 AWS Tübingen 2 Technical University of Denmark
Pseudocode	No	The paper does not contain any sections or figures explicitly labeled as 'Pseudocode' or 'Algorithm'.
Open Source Code	Yes	1The code for the evaluation study is at github.com/amazon-research/assaying-ood.
Open Datasets	Yes	We evaluate nine state-of-the-art deep learning models with publicly available pre-trained weights for Image Net1k / ILSVRC2012 [24]. We consider 36 datasets grouped into ten different tasks sharing the same labels. (...) We extract 172 (ID, OOD) dataset pairs from the different domains of the ten tasks: Domain Net [83], PACS [84], SVIRO [85], Terra Incognita [13] as well as the Caltech101 [86], VLCS [87], Sun09 [88], VOC2007 [89] and the Wilds datasets [90] (from which we extract two tasks).
Dataset Splits	Yes	For each task, we take a single training dataset to ﬁne-tune the model and report evaluation metrics on both its ID test set and all the other OOD test sets. (...) For the datasets from the WILDS benchmark, we use the provided ID test and OOD test splits.
Hardware Specification	Yes	See Appendix I.3, 17 GPU years on Nvidia T4 GPUs (cloud hosted)
Software Dependencies	No	Weights for the pre-trained models were taken from the Py Torch Image Models repository [100].
Experiment Setup	Yes	Models are ﬁne-tuned on a single GPU using Adam [91] with a batch-size of 64 and a constant learning rate. (...) For each model we consider the learning rate and the number of ﬁne-tuning epochs. We ﬁrst ran a large sweep over these two hyperparameters on a subset of the experiments and used it to pre-select a set of four parameter combinations that included the best performing models for each architecture. Additionally, we study three different augmentation strategies: standard Image Net augmentation (i.e., no additional augmentation), Rand Augment [101] and Aug Mix [28].