reproducibilityindex.ai

What Is Missing in IRM Training and Evaluation? Challenges and Solutions

Authors: Yihua Zhang, Pranay Sharma, Parikshit Ram, Mingyi Hong, Kush R. Varshney, Sijia Liu

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Lastly, we conduct extensive experiments (covering 7 existing IRM variants and 7 datasets) to justify the practical significance of revisiting IRM training and evaluation in a principled manner.
Researcher Affiliation	Collaboration	Yihua Zhang1, Pranay Sharma2, Parikshit Ram3, Mingyi Hong4, Kush Varshney3, Sijia Liu1,3 1Michigan State University, 2Carnegie Mellon University, 3IBM Research, 4University of Minnesota
Pseudocode	Yes	Algorithm A1 BLOC-IRM
Open Source Code	Yes	Fourth, codes are available at https:/github.com/OPTML-Group/BLOC-IRM.
Open Datasets	Yes	Our experiments are conducted over 7 datasets as referenced and shown in Tables 1, 2. Among these datasets, COLORED-MNIST, COLORED-FMNIST, CIFAR-MNIST, and COLORED-OBJECT are similarly curated, mimicking the pipeline of COLORED-MNIST (Arjovsky et al., 2019)... Furthermore, we consider other three real-world datasets CELEBA (Liu et al., 2015), PACS (Li et al., 2017) and VLCS (Torralba & Efros, 2011), without imposing artificial spurious correlations.
Dataset Splits	No	The paper describes how training and test environments are generated or selected (e.g., using different bias parameters or environments for testing), but it does not specify traditional train/validation/test dataset splits with percentages or sample counts for a fixed dataset.
Hardware Specification	No	The paper mentions that 'The computing resources used in this work were partially supported by the MIT-IBM Watson AI Lab and the Institute for Cyber-Enabled Research (ICER) at Michigan State University,' but it does not specify any particular hardware details such as GPU models, CPU types, or memory.
Software Dependencies	No	The paper mentions using 'Py Torch' and the 'Adam' optimizer, but it does not provide specific version numbers for these or other software dependencies.
Experiment Setup	Yes	Unless specified otherwise, our training pipeline uses the small-batch training setting. By default, we use the batch size of 1024 for COLORED-MNIST and COLORED-FMNIST, and 256 for other datasets. ... For the large-batch setting, we use the penalty weight of 106, 190 warm-up epochs, and 500 epochs in total... For the small-batch setting, we adopt the same penalty weight 106... use 50 warm-up epochs and total 200 epochs for all the methods. For other datasets, we adopt the batch size of 128 and use Res Net-18 as the default model architecture. We train for 200 epochs. We adopt the step-wise learning rate scheduler with an initial learning rate of 0.1. The learning rate decays by 0.1 at the 100th and 150th epochs.