Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
What Is Missing in IRM Training and Evaluation? Challenges and Solutions
Authors: Yihua Zhang, Pranay Sharma, Parikshit Ram, Mingyi Hong, Kush R. Varshney, Sijia Liu
ICLR 2023 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Lastly, we conduct extensive experiments (covering 7 existing IRM variants and 7 datasets) to justify the practical significance of revisiting IRM training and evaluation in a principled manner. |
| Researcher Affiliation | Collaboration | Yihua Zhang1, Pranay Sharma2, Parikshit Ram3, Mingyi Hong4, Kush Varshney3, Sijia Liu1,3 1Michigan State University, 2Carnegie Mellon University, 3IBM Research, 4University of Minnesota |
| Pseudocode | Yes | Algorithm A1 BLOC-IRM |
| Open Source Code | Yes | Fourth, codes are available at https:/github.com/OPTML-Group/BLOC-IRM. |
| Open Datasets | Yes | Our experiments are conducted over 7 datasets as referenced and shown in Tables 1, 2. Among these datasets, COLORED-MNIST, COLORED-FMNIST, CIFAR-MNIST, and COLORED-OBJECT are similarly curated, mimicking the pipeline of COLORED-MNIST (Arjovsky et al., 2019)... Furthermore, we consider other three real-world datasets CELEBA (Liu et al., 2015), PACS (Li et al., 2017) and VLCS (Torralba & Efros, 2011), without imposing artificial spurious correlations. |
| Dataset Splits | No | The paper describes how training and test environments are generated or selected (e.g., using different bias parameters or environments for testing), but it does not specify traditional train/validation/test dataset splits with percentages or sample counts for a fixed dataset. |
| Hardware Specification | No | The paper mentions that 'The computing resources used in this work were partially supported by the MIT-IBM Watson AI Lab and the Institute for Cyber-Enabled Research (ICER) at Michigan State University,' but it does not specify any particular hardware details such as GPU models, CPU types, or memory. |
| Software Dependencies | No | The paper mentions using 'Py Torch' and the 'Adam' optimizer, but it does not provide specific version numbers for these or other software dependencies. |
| Experiment Setup | Yes | Unless specified otherwise, our training pipeline uses the small-batch training setting. By default, we use the batch size of 1024 for COLORED-MNIST and COLORED-FMNIST, and 256 for other datasets. ... For the large-batch setting, we use the penalty weight of 106, 190 warm-up epochs, and 500 epochs in total... For the small-batch setting, we adopt the same penalty weight 106... use 50 warm-up epochs and total 200 epochs for all the methods. For other datasets, we adopt the batch size of 128 and use Res Net-18 as the default model architecture. We train for 200 epochs. We adopt the step-wise learning rate scheduler with an initial learning rate of 0.1. The learning rate decays by 0.1 at the 100th and 150th epochs. |