Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Generalizability of Adversarial Robustness Under Distribution Shifts

Authors: Kumail Alhamoud, Hasan Abed Al Kader Hammoud, Motasem Alfarra, Bernard Ghanem

TMLR 2023 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our work examines the interplay between domain generalization and adversarial robustness through comprehensive experiments on five standard DG benchmarks provided by Domain Bed (Gulrajani & Lopez-Paz, 2021) and WILDS (Koh et al., 2021). We investigate empirical and certified robustness against input perturbations and spatial deformations.
Researcher Affiliation	Academia	Kumail Alhamoud EMAIL King Abdullah University of Science and Technology (KAUST) Hasan Abed Al Kader Hammoud EMAIL King Abdullah University of Science and Technology (KAUST) Motasem Alfarra EMAIL King Abdullah University of Science and Technology (KAUST) Bernard Ghanem EMAIL King Abdullah University of Science and Technology (KAUST)
Pseudocode	No	The paper describes methods using mathematical equations and textual explanations (e.g., Sections 3, 4, 5) but does not contain any explicitly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code	No	The paper references external benchmarks like Domain Bed and WILDS, but does not explicitly state that the authors' own implementation code or specific methodology is made open-source or provide a link to a code repository.
Open Datasets	Yes	Our work examines the interplay between domain generalization and adversarial robustness through comprehensive experiments on five standard DG benchmarks provided by Domain Bed (Gulrajani & Lopez-Paz, 2021) and WILDS (Koh et al., 2021). We study robustness under a variety of datasets: PACS, Office Home, VLCS, and Terra Incognita (Gulrajani & Lopez Paz, 2021). To split the data into source and target domains, we use the Photo, Art, Cartoon, and Sketch distributions from PACS (Li et al., 2017). we use the DG dataset WILDS CAMELYON17 (Bándi et al., 2019; Koh et al., 2021)
Dataset Splits	Yes	For each considered dataset, we select a subset of N 1 domains to be the source (training) domains and keep the N th domain as the target (evaluation) domain. We follow Domain Bed in reporting the average result across all N different source vs. target splits Furthermore, we run each experiment with 3 different seeds and report the standard deviation across our runs. Note that we split the source domains (training set) into two subsets: a training subset (80%) and a validation subset (20%).
Hardware Specification	No	The paper discusses model architectures (Res Net-50, Vi T-Base) but does not provide specific details about the hardware (e.g., GPU models, CPU types) used for running the experiments.
Software Dependencies	No	The paper mentions frameworks and methods like Domain Bed, WILDS, Auto Attack, and PGD, but does not provide specific software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow, CUDA versions).
Experiment Setup	Yes	We train PGD models with adversarial augmentation to minimize the objective in Eq. 5 on the source domains, where λ = 0.5 and xadv is computed with a Projected Gradient Descent (PGD) attack (Madry et al., 2018) using 5 PGD steps. The TRADES models are trained to minimize the objective in Eq. 6, where β = 3. For all the models, the target domain remains unseen until test time. In the main paper, we report ℓ results using ϵ = 2/255. For PGD, we conducted the evaluation with 20 steps. We employ Monte Carlo sampling with 100k samples and a probability of failure of 10 3 to estimate p A and bound p B = 1 p A, following the standard practice (Zhai et al., 2020; Cohen et al., 2019; Alfarra et al., 2022a).