What Is Missing in IRM Training and Evaluation? Challenges and Solutions
Authors: Yihua Zhang, Pranay Sharma, Parikshit Ram, Mingyi Hong, Kush R. Varshney, Sijia Liu
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Lastly, we conduct extensive experiments (covering 7 existing IRM variants and 7 datasets) to justify the practical significance of revisiting IRM training and evaluation in a principled manner. |
| Researcher Affiliation | Collaboration | Yihua Zhang1, Pranay Sharma2, Parikshit Ram3, Mingyi Hong4, Kush Varshney3, Sijia Liu1,3 1Michigan State University, 2Carnegie Mellon University, 3IBM Research, 4University of Minnesota |
| Pseudocode | Yes | Algorithm A1 BLOC-IRM |
| Open Source Code | Yes | Fourth, codes are available at https:/github.com/OPTML-Group/BLOC-IRM. |
| Open Datasets | Yes | Our experiments are conducted over 7 datasets as referenced and shown in Tables 1, 2. Among these datasets, COLORED-MNIST, COLORED-FMNIST, CIFAR-MNIST, and COLORED-OBJECT are similarly curated, mimicking the pipeline of COLORED-MNIST (Arjovsky et al., 2019)... Furthermore, we consider other three real-world datasets CELEBA (Liu et al., 2015), PACS (Li et al., 2017) and VLCS (Torralba & Efros, 2011), without imposing artificial spurious correlations. |
| Dataset Splits | No | The paper describes how training and test environments are generated or selected (e.g., using different bias parameters or environments for testing), but it does not specify traditional train/validation/test dataset splits with percentages or sample counts for a fixed dataset. |
| Hardware Specification | No | The paper mentions that 'The computing resources used in this work were partially supported by the MIT-IBM Watson AI Lab and the Institute for Cyber-Enabled Research (ICER) at Michigan State University,' but it does not specify any particular hardware details such as GPU models, CPU types, or memory. |
| Software Dependencies | No | The paper mentions using 'Py Torch' and the 'Adam' optimizer, but it does not provide specific version numbers for these or other software dependencies. |
| Experiment Setup | Yes | Unless specified otherwise, our training pipeline uses the small-batch training setting. By default, we use the batch size of 1024 for COLORED-MNIST and COLORED-FMNIST, and 256 for other datasets. ... For the large-batch setting, we use the penalty weight of 106, 190 warm-up epochs, and 500 epochs in total... For the small-batch setting, we adopt the same penalty weight 106... use 50 warm-up epochs and total 200 epochs for all the methods. For other datasets, we adopt the batch size of 128 and use Res Net-18 as the default model architecture. We train for 200 epochs. We adopt the step-wise learning rate scheduler with an initial learning rate of 0.1. The learning rate decays by 0.1 at the 100th and 150th epochs. |