Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

An Analysis of Model Robustness across Concurrent Distribution Shifts

Authors: Myeongho Jeon, Suhwan Choi, Hyoje Lee, Teresa Yeo

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluated 26 algorithms that range from simple heuristic augmentations to zero-shot inference using foundation models, across 168 source-target pairs from eight datasets. Our analysis of over 100K models reveals that (i) concurrent DSs typically worsen performance compared to a single shift, with certain exceptions, (ii) if a model improves generalization for one distribution shift, it tends to be effective for others, and (iii) heuristic data augmentations achieve the best overall performance on both synthetic and real-world datasets.
Researcher Affiliation	Collaboration	Myeongho Jeon EMAIL École Polytechnique Fédérale de Lausanne Suhwan Choi EMAIL Seoul National University CRABs.ai Hyoje Lee EMAIL Samsung Research Teresa Yeo EMAIL Singapore-MIT Alliance for Research and Technology
Pseudocode	No	The paper describes algorithms and methods but does not present any structured pseudocode or algorithm blocks.
Open Source Code	Yes	Code: https: // github. com/ schoi828/ robustness
Open Datasets	Yes	Controlled datasets: We assess algorithms using five evaluation datasets: d Sprites, Shapes3D, Small Norb, Celeb A, and Deep Fashion. From these, we select four attributes: one is designated as the label (yl), and the other three as attributes (yα, yβ, yγ) to create DSs. Table 2 lists the attribute instances for yl and {yα, yβ, yγ} for the different controlled datasets. Uncontrolled real-world datasets: We use i Wild Cam, f Mo W, and Camelyon17 for evaluation.
Dataset Splits	Yes	We provide detailed counts for each split within the datasets. To create the DSs as outlined in Section 3, variations in dataset sizes across the DSs result from the limited availability of certain attribute combinations. Table 6: Dataset size. Please note that we have included 1% counterexamples for SC in our input. ... We allocated 20% of the training set as the validation set for parameter tuning, ensuring that both share the same distribution. ... Small Norb: We used the original train-test split provided by Small Norb...
Hardware Specification	Yes	All our experiments were performed using 8 NVIDIA H100 80GB HBM3 GPUs and Intel(R) Xeon(R) Gold 6448Y.
Software Dependencies	No	The paper does not list specific software dependencies with version numbers for libraries or programming languages.
Experiment Setup	Yes	To generate all the results reported in this script, we fine-tuned the hyperparameters. For the controlled datasets, we adopt early stopping where training stops early once the patience limit is reached. Validation accuracy is measured every 100 iterations and one patience is consumed if the best validation accuracy does not improve. The specific values are detailed in Table 8 and Table 9. We conducted a grid search with these parameters to optimize results for each algorithm across all DSs and datasets.