Discovering Environments with XRM
Authors: Mohammad Pezeshki, Diane Bouchacourt, Mark Ibrahim, Nicolas Ballas, Pascal Vincent, David Lopez-Paz
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | This section presents a series of experiments to showcase the effectiveness of XRM on two well-known benchmarks. Additional experiments are also conducted to identify scenarios where XRM excels, as well as scenarios where it fails to discover relevant environments. |
| Researcher Affiliation | Collaboration | 1FAIR at Meta 2Mila at Universit e de Montr eal 3CIFAR. Correspondence to: Mohammad Pezeshki <mpezeshki@meta.com>. |
| Pseudocode | Yes | Algorithm 1 CROSS-RISK MINIMIZATION (XRM) |
| Open Source Code | Yes | Code available at https://github. com/facebookresearch/XRM. |
| Open Datasets | Yes | For sub-population shift tasks, we experiment with seven datasets and four algorithms detailed in Appendix B. We compare results with 3 sources of environment annotations: Waterbirds (Wah et al., 2011), Celeb A (Liu et al., 2015), Multi NLI (Williams et al., 2017), Civil Comments (Borkan et al., 2019), Meta Shift (Liang & Zou, 2022), Imagenet BG (Xiao et al., 2020), and Color MNIST (Arjovsky et al., 2019). |
| Dataset Splits | Yes | For model selection, we adhere to the standard practice of using the worst-group-accuracy. We try 10 different hyper-parameter combinations detailed in Appendix B.7 with one random seed. We select the hyper-parameter combination and early-stopping iteration yielding maximal validation worst-group-accuracy (or, in the absence of groups, worst-class-accuracy). |
| Hardware Specification | Yes | We found on a Volta-32GB GPU running the learning to split group inference module took approximately 20 hours. |
| Software Dependencies | No | The paper mentions software like PyTorch (implied by code snippet), SGD, Adam W, ResNet-50, and BERT, but does not specify version numbers for these software components. |
| Experiment Setup | Yes | All images are resized and center-cropped to 224 × 224 pixels, and undergo no data augmentation. We use SGD with momentum 0.9 to learn from image datasets unless otherwise mentioned, and we employ Adam W (Loshchilov & Hutter, 2017) with default β1 = 0.9 and β2 = 0.999 for text benchmarks. |