reproducibilityindex.ai

Extending the WILDS Benchmark for Unsupervised Adaptation

Authors: Shiori Sagawa, Pang Wei Koh, Tony Lee, Irena Gao, Sang Michael Xie, Kendrick Shen, Ananya Kumar, Weihua Hu, Michihiro Yasunaga, Henrik Marklund, Sara Beery, Etienne David, Ian Stavness, Wei Guo, Jure Leskovec, Kate Saenko, Tatsunori Hashimoto, Sergey Levine, Chelsea Finn, Percy Liang

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this work, we present the WILDS 2.0 update, which extends 8 of the 10 datasets in the WILDS benchmark of distribution shifts to include curated unlabeled data that would be realistically obtainable in deployment. These datasets span a wide range of applications (from histology to wildlife conservation), tasks (classiﬁcation, regression, and detection), and modalities (photos, satellite images, microscope slides, text, molecular graphs). The update maintains consistency with the original WILDS benchmark by using identical labeled training, validation, and test sets, as well as identical evaluation metrics. We systematically benchmark state-of-the-art methods that use unlabeled data, including domain-invariant, self-training, and self-supervised methods, and show that their success on WILDS is limited.
Researcher Affiliation	Academia	1Stanford University 2Caltech 3INRAE 4University of Saskatchewan 5University of Tokyo 6Boston University 7University of California, Berkeley
Pseudocode	Yes	Algorithm 1: CORAL; Algorithm 2: DANN; Algorithm 3: Pseudo-Label; Algorithm 4: Fix Match; Algorithm 5: Noisy Student
Open Source Code	Yes	To this end, we have updated the open-source Python WILDS package to include unlabeled data loaders, compatible implementations of all the methods we benchmarked, and scripts to replicate all experiments in this paper (Appendix G). Code and leaderboards are available at https://wilds.stanford.edu.
Open Datasets	Yes	All WILDS datasets are publicly available at https://wilds.stanford.edu, together with code and scripts to replicate all of the experiments in this paper.
Dataset Splits	Yes	Table 1: All datasets have labeled source, validation, and target data, as well as unlabeled data from one or more types of domains, depending on what is realistic for the application. ... Following WILDS 1.0, we used the labeled out-of-distribution (OOD) validation set to select hyperparameters and for early stopping (Koh et al., 2021).
Hardware Specification	Yes	Overall, we ran 600+ experiments for 7,000 GPU hours on NVIDIA V100s. ... We ran experiments on a mix of NVIDIA GPUs: V100, K80, Ge Force RTX, Titan RTX, Titan Xp, and Titan V.
Software Dependencies	No	The paper mentions software like "Weights and Biases platform (Biewald, 2020)", "Distil BERT (Sanh et al., 2019)", "BERT implementation (Devlin et al., 2019)", and a "public Sw AV repository", but it does not specify explicit version numbers for these software dependencies (e.g., PyTorch 1.x, TensorFlow 2.x, or specific library versions).
Experiment Setup	Yes	Hyperparameters. We tuned each method on each dataset separately using random hyperparameter search. Following WILDS 1.0, we used the labeled out-of-distribution (OOD) validation set to select hyperparameters and for early stopping (Koh et al., 2021). ... Appendix D for further experimental details.