Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Set Valued Predictions For Robust Domain Generalization

Authors: Ron Tsibulsky, Daniel Nevo, Uri Shalit

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate our approach on several real-world datasets from the WILDS benchmark, demonstrating its potential as a promising direction for robust domain generalization. Finally, in Section 5 we demonstrate the effectiveness of our proposed methods using real-world datasets from the WILDS benchmark (Koh et al., 2021). 5. Experiments We conduct several experiments to evaluate the performance of five different approaches to the problem of DG.
Researcher Affiliation	Academia	1Department of Computer Science, Tel Aviv University, Tel Aviv, Israel 2Department of Statistics and Operations Research, Tel Aviv University, Tel Aviv, Israel 3Department of Data and Decisions Science, Technion, Haifa, Israel. Correspondence to: Ron Tsibulsky <EMAIL>.
Pseudocode	Yes	Algorithm 1 SET-COVER Initialize θ, C for i from 1 to NUM EPOCHS do for b in batches do Call COMPUTE Ly(θ, C) L(θ, C) = P y Y Ly(θ, C) perform GD step for θ with respect to L(θ, C) IF i % C UPDATE FREQUENCY == 0 do Call UPDATE C end for Call UPDATE C end for Subroutine: COMPUTE Ly(θ, C) Ly(θ, C) = X i b 1Yi =y max{0, 1 + hθ y(Xi)}+ 1Yi=y Cei,y max{0, 1 hθ y(Xi)} Subroutine: UPDATE C for e Etrain do coverage 1 \|Ge,y\| P i Ge,y 1[hθ y(Xi) > 0] ν 1 (coveragee,y (1 γ)) s 2 if ν > 1 else 1 Ce,y Ce,y s ν end for end for
Open Source Code	Yes	Reproducibility Code We release all code and evaluation scripts at https: //github.com/ront65/set-valued-ood to facilitate reproducibility.
Open Datasets	Yes	We evaluate set-valued models on benchmarks from the WILDS (Koh et al., 2021) suite of benchmarks, which is designed to test models against real-world distribution shifts across various datasets and modalities. The benchmarks we use include: Camelyon (Bandi et al., 2018): This dataset consists of pathological scans from 43 patients across 5 hospitals. ... FMoW (Christie et al., 2018): A satellite image dataset ... i Wild Cam (Beery et al., 2020): This dataset includes images of animals in the wild ... Amazon (Ni et al., 2019): A dataset of textual reviews...
Dataset Splits	Yes	All training sets are composed of randomly sampled instances from randomly sampled domains, and for all data sets the methods are evaluated on unseen test domains. Further experimental details are available at Appendix E.1. ... Table 2. hyper-parameters used for our experiments ... Number of train domains ... Number of test domains ... Max train domain size ... Max test domain size
Hardware Specification	No	We report the average training times (measured on a single NVIDIA GPU) for ERM and SET-COVER on each dataset in Table 9. SET-COVER incurs a moderate increase in training time (approximately 30% on average) compared to ERM, primarily due to the optimization of Lagrange multipliers (C in our algorithm). Aside from this, SET-COVER shares similar computational requirements with ERM, relying on hinge-loss-based optimization without substantial architectural complexity. The paper mentions using "a single NVIDIA GPU" but does not provide specific model numbers or other detailed specifications for the hardware used in experiments.
Software Dependencies	No	We use the Domain Bed (Gulrajani & Lopez-Paz, 2020) package to train these models. The paper mentions the "Domain Bed package" but does not provide specific version numbers for it or any other software libraries or dependencies.
Experiment Setup	Yes	E.1. Experiments Hyper-Parameters Table 2. hyper-parameters used for our experiments Hyper-Parameter Camelyon Fmow Iwildcam Amazon Synthetic Batch Size 128 64 64 128 128 Learning Rate 0.001 0.001 0.001 0.001 0.001 Number of Epochs 5 5 5 5 30 Number of train domains 20 20 80 500 25 Number of test domains 20 18 40 100 25 Max train domain size 6,000 4,500 3,000 1,000 2,000 Max test domain size 2,000 3,000 1,000 1,000 1,000 Relevant for SET-COVER: Initial C value 5 5 5 5 5 Frequency of C values update 500 500 500 500 500