reproducibilityindex.ai

Discovering Environments with XRM

Authors: Mohammad Pezeshki, Diane Bouchacourt, Mark Ibrahim, Nicolas Ballas, Pascal Vincent, David Lopez-Paz

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	This section presents a series of experiments to showcase the effectiveness of XRM on two well-known benchmarks. Additional experiments are also conducted to identify scenarios where XRM excels, as well as scenarios where it fails to discover relevant environments.
Researcher Affiliation	Collaboration	1FAIR at Meta 2Mila at Universit e de Montr eal 3CIFAR. Correspondence to: Mohammad Pezeshki <mpezeshki@meta.com>.
Pseudocode	Yes	Algorithm 1 CROSS-RISK MINIMIZATION (XRM)
Open Source Code	Yes	Code available at https://github. com/facebookresearch/XRM.
Open Datasets	Yes	For sub-population shift tasks, we experiment with seven datasets and four algorithms detailed in Appendix B. We compare results with 3 sources of environment annotations: Waterbirds (Wah et al., 2011), Celeb A (Liu et al., 2015), Multi NLI (Williams et al., 2017), Civil Comments (Borkan et al., 2019), Meta Shift (Liang & Zou, 2022), Imagenet BG (Xiao et al., 2020), and Color MNIST (Arjovsky et al., 2019).
Dataset Splits	Yes	For model selection, we adhere to the standard practice of using the worst-group-accuracy. We try 10 different hyper-parameter combinations detailed in Appendix B.7 with one random seed. We select the hyper-parameter combination and early-stopping iteration yielding maximal validation worst-group-accuracy (or, in the absence of groups, worst-class-accuracy).
Hardware Specification	Yes	We found on a Volta-32GB GPU running the learning to split group inference module took approximately 20 hours.
Software Dependencies	No	The paper mentions software like PyTorch (implied by code snippet), SGD, Adam W, ResNet-50, and BERT, but does not specify version numbers for these software components.
Experiment Setup	Yes	All images are resized and center-cropped to 224 × 224 pixels, and undergo no data augmentation. We use SGD with momentum 0.9 to learn from image datasets unless otherwise mentioned, and we employ Adam W (Loshchilov & Hutter, 2017) with default β1 = 0.9 and β2 = 0.999 for text benchmarks.