Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Unified Breakdown Analysis for Byzantine Robust Gossip

Authors: Renaud Gaucher, Aymeric Dieuleveut, Hadrien Hendrikx

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We give experimental evidence to validate the effectiveness of CS+ RG and highlight the gap with NNA, in particular against a novel attack tailored to decentralized communications. ... Section 6. Experimental evaluation. We follow Farhadkhani et al. (2023) (on which the core of our code is based), and present results for classification tasks on MNIST and CIFAR-10 datasets, as well as plain averaging tasks. ... In Figure 1, it appears that the Sp H attack is more efficient in disrupting Clipped Gossip, GTS RG and IOS than Dissensus and ALIE, and that CS+ RG is highly resilient in the setup considered.
Researcher Affiliation	Academia	1Centre de math ematiques appliqu ees, Ecole polytechnique, Institut Polytechnique de Paris, Palaiseau France 2Centre Inria de l Univ. Grenoble Alpes, CNRS, LJK, Grenoble, France. Correspondence to: Renaud Gaucher <EMAIL>.
Pseudocode	Yes	Algorithm 1 Byzantine-Resilient D-SGD with F RG
Open Source Code	Yes	See Appendix B for a detailed experimental setup and our implementation available at https://github.com/renaudgaucher/Byzantine-Robust-Gossip.
Open Datasets	Yes	We follow Farhadkhani et al. (2023) (on which the core of our code is based), and present results for classification tasks on MNIST and CIFAR-10 datasets, as well as plain averaging tasks.
Dataset Splits	No	The paper uses well-known datasets (MNIST, CIFAR-10) but does not explicitly state the training/test/validation splits used, nor does it refer to specific standard splits with citations within the text. It describes data heterogeneity and preprocessing, but not the partitioning into subsets for training, validation, or testing.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., CPU/GPU models, memory) used for running its experiments. It describes the experimental setup in terms of software and dataset usage but omits hardware specifications.
Software Dependencies	No	The paper states, 'Our experimental setting is built on top of the code provided by Farhadkhani et al. (2023),' indicating a dependency. However, it does not provide specific version numbers for any software libraries, frameworks, or programming languages used (e.g., Python 3.x, PyTorch 1.x, CUDA).
Experiment Setup	Yes	The architecture of the model used and the experimental setup are proposed in Table 1. Table 1. Detailed experimental setting Dataset MNIST CIFAR-10 Model type CNN CNN Batch size 64 64 Learning rate ηop = 0.1 ηop = 0.5 Momentum β = 0.9 β = 0.99 Number of Iterations T = 300 T = 5000