Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Personalized Federated Learning with Spurious Features: An Adversarial Approach

Authors: Xiaoyang Wang, Han Zhao, Klara Nahrstedt, Sanmi Koyejo

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirical results on object and action recognition tasks show that our proposed approach bounds personalized models from further exploiting spurious features while preserving the benefit of enhanced accuracy from fine-tuning. We conduct extensive experiments to validate the effectiveness of the proposed methods under FL settings. Our experiments on MNIST (Deng, 2012), Coil20 (Nene et al., 1996), Celeb A (Liu et al., 2015; Caldas et al., 2018), and biased action recognition (BAR) (Nam et al., 2020) datasets show that the proposed approach reduces the accuracy disparity of personalized models from 18.38% to 3.42%. Our method also preserves the benefit of the enhanced average accuracy from fine-tuning, resulting in 4.48% accuracy improvement in the global environment.
Researcher Affiliation	Academia	Xiaoyang Wang EMAIL Department of Computer Science University of Illinois at Urbana-Champaign Han Zhao EMAIL Department of Computer Science University of Illinois at Urbana-Champaign Klara Nahrstedt EMAIL Department of Computer Science University of Illinois at Urbana-Champaign Sanmi Koyejo EMAIL Department of Computer Science Stanford University
Pseudocode	No	The paper describes methods and processes in narrative text and mathematical equations, but does not include any explicitly labeled pseudocode or algorithm blocks with structured, step-by-step instructions.
Open Source Code	No	The paper states it was "Reviewed on Open Review: https: // openreview. net/ forum? id= N2wx9UVHk H", but this link is to a review forum and not a direct source code repository. There is no explicit statement in the paper about the release of source code for the described methodology, nor is there a direct link to a code repository.
Open Datasets	Yes	We conduct extensive experiments to validate the effectiveness of the proposed methods under FL settings. Our experiments on MNIST (Deng, 2012), Coil20 (Nene et al., 1996), Celeb A (Liu et al., 2015; Caldas et al., 2018), and biased action recognition (BAR) (Nam et al., 2020) datasets show that the proposed approach reduces the accuracy disparity of personalized models from 18.38% to 3.42%.
Dataset Splits	Yes	Local datasets are further partitioned to train/validation/test set with a ratio of 72:8:20, following prior work (Li et al., 2021).
Hardware Specification	No	The paper does not explicitly describe the specific hardware used for running its experiments (e.g., GPU models, CPU types, or other accelerator specifications).
Software Dependencies	No	We use Adam optimizer (Kingma & Ba, 2015) throughout our experiments with a learning rate of 1e-4 for MNIST, Celeb A, and BAR and 2e-4 for Coil20. While an optimizer is mentioned, specific software dependencies like programming language versions or machine learning framework versions (e.g., PyTorch, TensorFlow) are not provided with version numbers.
Experiment Setup	Yes	We use Adam optimizer (Kingma & Ba, 2015) throughout our experiments with a learning rate of 1e-4 for MNIST, Celeb A, and BAR and 2e-4 for Coil20. ... We train the global model for 500 rounds. 5 clients are selected per round, each performing 5 epochs of local updates. We tune the coefficients of the adversarial transferability and L2 regularization terms from {0.01, 0.1, 1.0, 10.0} and select the largest value that does not decrease the validation accuracy during penalization. We start the attack budget at 0.031 (i.e., 8/255) and gradually decrease it such that 30%–50% of the attack succeeds. We configure ϵ to 0.031/0.01/0.031 for MNIST/Celeb A/Coil20, respectively. We fine-tune the global model for 5 epochs on MNIST/BAR and 10 epochs on Coil20/Celeb A, which are sufficient for the personalized models to converge.