Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

On the Byzantine-Resilience of Distillation-Based Federated Learning

Authors: Christophe Roux, Max Zimmer, Sebastian Pokutta

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate Fed Distill on the CIFAR-10/100 (Krizhevsky et al., 2009), CINIC-10 (Darlow et al., 2018), and Clothing1M (Xiao et al., 2015) datasets using the Res Net (He et al., 2016) and Wide Res Net (Zagoruyko & Komodakis, 2016) architectures. We keep 5% of the training datasets for validation and split the remaining data evenly among the clients. Each experiment is performed with multiple random seeds and we report mean and standard deviation. Figure 1 compares how Fed AVG and Fed Distill are impacted by byzantine clients when using two naive attacks. We measured the final test accuracy when varying the fraction of byzantine clients α.
Researcher Affiliation	Academia	Christophe Roux , Max Zimmer & Sebastian Pokutta Department for AI in Society, Science, and Technology, Zuse Institute Berlin, Germany Institute of Mathematics, Technische Universität Berlin, Germany EMAIL
Pseudocode	Yes	Algorithm 1 Federated Learning (FL) and Algorithm 2 Exp Guard
Open Source Code	Yes	Our code is available at github.com/ZIB-IOL/Fed Distill.
Open Datasets	Yes	We evaluate Fed Distill on the CIFAR-10/100 (Krizhevsky et al., 2009), CINIC-10 (Darlow et al., 2018), and Clothing1M (Xiao et al., 2015) datasets using the Res Net (He et al., 2016) and Wide Res Net (Zagoruyko & Komodakis, 2016) architectures. For CIFAR-10, we use the unlabeled split of the STL-10 dataset (Coates et al., 2011), consisting of 100k samples. For CINIC-10 we use the validation split of CINIC-10 consisting of 90k samples. For Clothing1M, we use the noisily labeled split of Clothing1M, consisting of 1M samples.
Dataset Splits	Yes	We keep 5% of the training datasets for validation and split the remaining data evenly among the clients.
Hardware Specification	No	No specific hardware details (e.g., GPU/CPU models, memory, or cloud instances) are provided in the paper. The text only refers to architectures like Res Net and Wide Res Net, and parameters like batch size, but not the underlying physical hardware used for computation.
Software Dependencies	No	No specific software dependencies with version numbers (e.g., PyTorch 1.9, Python 3.8) are listed in the paper. The text mentions training with SGD but does not specify the software framework or library versions used.
Experiment Setup	Yes	Specifically, we set the number of clients to 20 and the number of communication rounds to 10. We conduct ablations on these parameters in Appendix D. We set the number of total local epochs each client performs, i.e., training on their private datasets, to a fixed value depending on the dataset used. Similarly, we set a total number of communications and uniformly distribute these communications among the local epochs. We train clients and server using SGD with weight decay and a linearly decaying learning rate from 0.1 to 0, momentum is set to 0.9. Appendix B contains a detailed account of the experimental setup. (From Appendix B Table): CIFAR-10: Batch size 128, Server epochs per round 80, Total local epochs 400. CIFAR-100: Batch size 256, Server epochs per round 100, Total local epochs 250. CINIC-10: Batch size 128, Server epochs per round 80, Total local epochs 200. Clothing1M: Batch size 256, Server epochs per round 10, Total local epochs 300.