Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Towards Understanding Variants of Invariant Risk Minimization through the Lens of Calibration

Authors: Kotaro Yoshida, Hiroki Naganuma

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Through a comparative analysis of datasets with distributional shifts, we observe that Information Bottleneck-based IRM achieves consistent calibration across different environments. This observation suggests that information compression techniques, such as Information Bottleneck are potentially effective in achieving model invariance. Furthermore, our empirical evidence indicates that models exhibiting consistent calibration across environments are also well-calibrated. This demonstrates that invariance and cross-environment calibration are empirically equivalent.
Researcher Affiliation	Academia	Yoshida Kotaro1 , Hiroki Naganuma2,3 EMAIL, EMAIL, 1Tokyo Institute of Technology, 2Mila Quebec AI Institute, 3Université de Montréal,
Pseudocode	No	The paper describes methods and optimizations using mathematical formulations and descriptive text, but it does not contain any clearly labeled pseudocode blocks or algorithms formatted like code.
Open Source Code	Yes	Our code is available at https://github.com/katoro8989/IRM_Variants_Calibration
Open Datasets	Yes	Specifically, the datasets used were Colored MNIST (CMNIST) (Arjovsky et al., 2020), Rotated MNIST (RMNIST) (Ghifary et al., 2015), PACS (Li et al., 2017), and VLCS (Fang et al., 2013), sourced from the Domain Bed benchmark (Gulrajani & Lopez-Paz, 2020).
Dataset Splits	Yes	We split each dataset into training and validation sets, with 80% used for training and the remaining 20% for validation. The environment partitions were as follows: CMNIST: Etrain = [10%, 20%], Etest = [90%] RMNIST: Etrain = [15 , 30 , 45 , 60 , 75 ], Etest = [0 ] PACS: Etrain = [Photo, Painting, Sketch], Etest = [Art] VLCS: Etrain = [Caltech101, Label Me, SUN09], Etest = [V OC2007]
Hardware Specification	Yes	We acknowledge the generous allocation of computational resources from the TSUBAME3.0 supercomputer facilitated by the Tokyo Institute of Technology.
Software Dependencies	No	For optimization, Adam (Kingma & Ba, 2015) was used consistently across all models, and the tuning of learning rates and hyperparameters for each method was conducted in accordance with their respective papers. No specific version numbers for Adam or other software dependencies are provided.
Experiment Setup	Yes	In the experiments, we set the batch size to 256 for CMNIST, 128 for RMNIST, and 16 for PACS and VLCS. Grid search was performed on the learning rate for all experiments, with values of [1e-4, 5e-4, 1e-3, 5e-3]. For the hyperparameters specific to each approximation method, grid search was conducted as shown in Table 3.