reproducibilityindex.ai

WILDS: A Benchmark of in-the-Wild Distribution Shifts

Authors: Pang Wei Koh, Shiori Sagawa, Henrik Marklund, Sang Michael Xie, Marvin Zhang, Akshay Balsubramani, Weihua Hu, Michihiro Yasunaga, Richard Lanas Phillips, Irena Gao, Tony Lee, Etienne David, Ian Stavness, Wei Guo, Berton Earnshaw, Imran Haque, Sara M Beery, Jure Leskovec, Anshul Kundaje, Emma Pierson, Sergey Levine, Chelsea Finn, Percy Liang

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We ascertained this for each dataset by training standard models using empirical risk minimization (ERM), i.e., minimizing the average training loss, and then comparing their out-of-distribution (OOD) vs. in-distribution (ID) performance. The OOD setting is captured by the default train/test split and the evaluation criteria described in Section 4... Table 1 shows that for each dataset, OOD performance is consistently and substantially lower than ID performance.
Researcher Affiliation	Collaboration	1Stanford 2UC Berkeley 3Cornell 4INRAE 5USask 6UTokyo 7Recursion 8Caltech 9Microsoft Research.
Pseudocode	No	The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code	Yes	To facilitate method development, we provide an open-source package that automates dataset loading, contains default model architectures and hyperparameters, and standardizes evaluations. The full paper, code, and leaderboards are available at https://wilds.stanford.edu.
Open Datasets	Yes	To address this gap, we present WILDS, a curated benchmark of 10 datasets reflecting a diverse range of distribution shifts... The full paper, code, and leaderboards are available at https://wilds.stanford.edu.
Dataset Splits	Yes	The OOD setting is captured by the default train/test split and the evaluation criteria described in Section 4... We strongly encourage all model developers to use the provided OOD validation sets for development and model selection, and to only use the OOD test sets for their final evaluations.
Hardware Specification	No	The paper states that an executable version of the paper is hosted on Coda Lab, which includes the environment used for experiments. However, it does not provide specific hardware details such as GPU models, CPU types, or memory specifications.
Software Dependencies	No	The paper mentions that an "environment" used for experiments is available on Coda Lab and that an "open-source Python package" is provided. However, it does not list specific software dependencies with version numbers (e.g., Python 3.8, PyTorch 1.9).
Experiment Setup	Yes	To facilitate method development, we provide an open-source package that automates dataset loading, contains default model architectures and hyperparameters, and standardizes evaluations... More experimental details are in Appendix G, and dataset-specific hyperparameters and domain choices are discussed in Appendix H.