Extending the WILDS Benchmark for Unsupervised Adaptation
Authors: Shiori Sagawa, Pang Wei Koh, Tony Lee, Irena Gao, Sang Michael Xie, Kendrick Shen, Ananya Kumar, Weihua Hu, Michihiro Yasunaga, Henrik Marklund, Sara Beery, Etienne David, Ian Stavness, Wei Guo, Jure Leskovec, Kate Saenko, Tatsunori Hashimoto, Sergey Levine, Chelsea Finn, Percy Liang
ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this work, we present the WILDS 2.0 update, which extends 8 of the 10 datasets in the WILDS benchmark of distribution shifts to include curated unlabeled data that would be realistically obtainable in deployment. These datasets span a wide range of applications (from histology to wildlife conservation), tasks (classification, regression, and detection), and modalities (photos, satellite images, microscope slides, text, molecular graphs). The update maintains consistency with the original WILDS benchmark by using identical labeled training, validation, and test sets, as well as identical evaluation metrics. We systematically benchmark state-of-the-art methods that use unlabeled data, including domain-invariant, self-training, and self-supervised methods, and show that their success on WILDS is limited. |
| Researcher Affiliation | Academia | 1Stanford University 2Caltech 3INRAE 4University of Saskatchewan 5University of Tokyo 6Boston University 7University of California, Berkeley |
| Pseudocode | Yes | Algorithm 1: CORAL; Algorithm 2: DANN; Algorithm 3: Pseudo-Label; Algorithm 4: Fix Match; Algorithm 5: Noisy Student |
| Open Source Code | Yes | To this end, we have updated the open-source Python WILDS package to include unlabeled data loaders, compatible implementations of all the methods we benchmarked, and scripts to replicate all experiments in this paper (Appendix G). Code and leaderboards are available at https://wilds.stanford.edu. |
| Open Datasets | Yes | All WILDS datasets are publicly available at https://wilds.stanford.edu, together with code and scripts to replicate all of the experiments in this paper. |
| Dataset Splits | Yes | Table 1: All datasets have labeled source, validation, and target data, as well as unlabeled data from one or more types of domains, depending on what is realistic for the application. ... Following WILDS 1.0, we used the labeled out-of-distribution (OOD) validation set to select hyperparameters and for early stopping (Koh et al., 2021). |
| Hardware Specification | Yes | Overall, we ran 600+ experiments for 7,000 GPU hours on NVIDIA V100s. ... We ran experiments on a mix of NVIDIA GPUs: V100, K80, Ge Force RTX, Titan RTX, Titan Xp, and Titan V. |
| Software Dependencies | No | The paper mentions software like "Weights and Biases platform (Biewald, 2020)", "Distil BERT (Sanh et al., 2019)", "BERT implementation (Devlin et al., 2019)", and a "public Sw AV repository", but it does not specify explicit version numbers for these software dependencies (e.g., PyTorch 1.x, TensorFlow 2.x, or specific library versions). |
| Experiment Setup | Yes | Hyperparameters. We tuned each method on each dataset separately using random hyperparameter search. Following WILDS 1.0, we used the labeled out-of-distribution (OOD) validation set to select hyperparameters and for early stopping (Koh et al., 2021). ... Appendix D for further experimental details. |