WILDS: A Benchmark of in-the-Wild Distribution Shifts

Authors: Pang Wei Koh, Shiori Sagawa, Henrik Marklund, Sang Michael Xie, Marvin Zhang, Akshay Balsubramani, Weihua Hu, Michihiro Yasunaga, Richard Lanas Phillips, Irena Gao, Tony Lee, Etienne David, Ian Stavness, Wei Guo, Berton Earnshaw, Imran Haque, Sara M Beery, Jure Leskovec, Anshul Kundaje, Emma Pierson, Sergey Levine, Chelsea Finn, Percy Liang

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We ascertained this for each dataset by training standard models using empirical risk minimization (ERM), i.e., minimizing the average training loss, and then comparing their out-of-distribution (OOD) vs. in-distribution (ID) performance. The OOD setting is captured by the default train/test split and the evaluation criteria described in Section 4... Table 1 shows that for each dataset, OOD performance is consistently and substantially lower than ID performance.
Researcher Affiliation Collaboration 1Stanford 2UC Berkeley 3Cornell 4INRAE 5USask 6UTokyo 7Recursion 8Caltech 9Microsoft Research.
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code Yes To facilitate method development, we provide an open-source package that automates dataset loading, contains default model architectures and hyperparameters, and standardizes evaluations. The full paper, code, and leaderboards are available at https://wilds.stanford.edu.
Open Datasets Yes To address this gap, we present WILDS, a curated benchmark of 10 datasets reflecting a diverse range of distribution shifts... The full paper, code, and leaderboards are available at https://wilds.stanford.edu.
Dataset Splits Yes The OOD setting is captured by the default train/test split and the evaluation criteria described in Section 4... We strongly encourage all model developers to use the provided OOD validation sets for development and model selection, and to only use the OOD test sets for their final evaluations.
Hardware Specification No The paper states that an executable version of the paper is hosted on Coda Lab, which includes the environment used for experiments. However, it does not provide specific hardware details such as GPU models, CPU types, or memory specifications.
Software Dependencies No The paper mentions that an "environment" used for experiments is available on Coda Lab and that an "open-source Python package" is provided. However, it does not list specific software dependencies with version numbers (e.g., Python 3.8, PyTorch 1.9).
Experiment Setup Yes To facilitate method development, we provide an open-source package that automates dataset loading, contains default model architectures and hyperparameters, and standardizes evaluations... More experimental details are in Appendix G, and dataset-specific hyperparameters and domain choices are discussed in Appendix H.