Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
An Analysis of Model Robustness across Concurrent Distribution Shifts
Authors: Myeongho Jeon, Suhwan Choi, Hyoje Lee, Teresa Yeo
TMLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluated 26 algorithms that range from simple heuristic augmentations to zero-shot inference using foundation models, across 168 source-target pairs from eight datasets. Our analysis of over 100K models reveals that (i) concurrent DSs typically worsen performance compared to a single shift, with certain exceptions, (ii) if a model improves generalization for one distribution shift, it tends to be effective for others, and (iii) heuristic data augmentations achieve the best overall performance on both synthetic and real-world datasets. |
| Researcher Affiliation | Collaboration | Myeongho Jeon EMAIL École Polytechnique Fédérale de Lausanne Suhwan Choi EMAIL Seoul National University CRABs.ai Hyoje Lee EMAIL Samsung Research Teresa Yeo EMAIL Singapore-MIT Alliance for Research and Technology |
| Pseudocode | No | The paper describes algorithms and methods but does not present any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code: https: // github. com/ schoi828/ robustness |
| Open Datasets | Yes | Controlled datasets: We assess algorithms using five evaluation datasets: d Sprites, Shapes3D, Small Norb, Celeb A, and Deep Fashion. From these, we select four attributes: one is designated as the label (yl), and the other three as attributes (yα, yβ, yγ) to create DSs. Table 2 lists the attribute instances for yl and {yα, yβ, yγ} for the different controlled datasets. Uncontrolled real-world datasets: We use i Wild Cam, f Mo W, and Camelyon17 for evaluation. |
| Dataset Splits | Yes | We provide detailed counts for each split within the datasets. To create the DSs as outlined in Section 3, variations in dataset sizes across the DSs result from the limited availability of certain attribute combinations. Table 6: Dataset size. Please note that we have included 1% counterexamples for SC in our input. ... We allocated 20% of the training set as the validation set for parameter tuning, ensuring that both share the same distribution. ... Small Norb: We used the original train-test split provided by Small Norb... |
| Hardware Specification | Yes | All our experiments were performed using 8 NVIDIA H100 80GB HBM3 GPUs and Intel(R) Xeon(R) Gold 6448Y. |
| Software Dependencies | No | The paper does not list specific software dependencies with version numbers for libraries or programming languages. |
| Experiment Setup | Yes | To generate all the results reported in this script, we fine-tuned the hyperparameters. For the controlled datasets, we adopt early stopping where training stops early once the patience limit is reached. Validation accuracy is measured every 100 iterations and one patience is consumed if the best validation accuracy does not improve. The specific values are detailed in Table 8 and Table 9. We conducted a grid search with these parameters to optimize results for each algorithm across all DSs and datasets. |