Discovering Environments with XRM

Authors: Mohammad Pezeshki, Diane Bouchacourt, Mark Ibrahim, Nicolas Ballas, Pascal Vincent, David Lopez-Paz

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental This section presents a series of experiments to showcase the effectiveness of XRM on two well-known benchmarks. Additional experiments are also conducted to identify scenarios where XRM excels, as well as scenarios where it fails to discover relevant environments.
Researcher Affiliation Collaboration 1FAIR at Meta 2Mila at Universit e de Montr eal 3CIFAR. Correspondence to: Mohammad Pezeshki <mpezeshki@meta.com>.
Pseudocode Yes Algorithm 1 CROSS-RISK MINIMIZATION (XRM)
Open Source Code Yes Code available at https://github. com/facebookresearch/XRM.
Open Datasets Yes For sub-population shift tasks, we experiment with seven datasets and four algorithms detailed in Appendix B. We compare results with 3 sources of environment annotations: Waterbirds (Wah et al., 2011), Celeb A (Liu et al., 2015), Multi NLI (Williams et al., 2017), Civil Comments (Borkan et al., 2019), Meta Shift (Liang & Zou, 2022), Imagenet BG (Xiao et al., 2020), and Color MNIST (Arjovsky et al., 2019).
Dataset Splits Yes For model selection, we adhere to the standard practice of using the worst-group-accuracy. We try 10 different hyper-parameter combinations detailed in Appendix B.7 with one random seed. We select the hyper-parameter combination and early-stopping iteration yielding maximal validation worst-group-accuracy (or, in the absence of groups, worst-class-accuracy).
Hardware Specification Yes We found on a Volta-32GB GPU running the learning to split group inference module took approximately 20 hours.
Software Dependencies No The paper mentions software like PyTorch (implied by code snippet), SGD, Adam W, ResNet-50, and BERT, but does not specify version numbers for these software components.
Experiment Setup Yes All images are resized and center-cropped to 224 × 224 pixels, and undergo no data augmentation. We use SGD with momentum 0.9 to learn from image datasets unless otherwise mentioned, and we employ Adam W (Loshchilov & Hutter, 2017) with default β1 = 0.9 and β2 = 0.999 for text benchmarks.