Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Streaming Federated Learning with Markovian Data

Authors: Khiem HUYNH, Malcolm Egan, Giovanni Neglia, Jean-Marie Gorce

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our analysis is validated via experiments with real pollution monitoring time series data. 5 Numerical Results In this section, we evaluate the performance of Local SGD, Minibatch SGD, and Local SGD with Momentum on the Beijing Multi-Site Air-Quality dataset [7], which contains hourly measurements from 12 weather stations across China, collected between March 2013 and February 2017.
Researcher Affiliation	Academia	Tan-Khiem Huynh Inria, INSA Lyon Malcolm Egan Inria, INSA Lyon Giovanni Neglia Inria, Université Côte d Azur Jean-Marie Gorce Inria, INSA Lyon Inria, INSA Lyon, CITI, UR3720, 69621 Villeurbanne, France
Pseudocode	Yes	Algorithm 1 Local SGD-M Algorithm 2 Minibatch SGD Algorithm 3 Local SGD
Open Source Code	Yes	Code is provided in the supplemental material with detailed instructions.
Open Datasets	Yes	5 Numerical Results In this section, we evaluate the performance of Local SGD, Minibatch SGD, and Local SGD with Momentum on the Beijing Multi-Site Air-Quality dataset [7], which contains hourly measurements from 12 weather stations across China, collected between March 2013 and February 2017. Reference [7]: Song Chen. Beijing Multi-Site Air Quality. UCI Machine Learning Repository, 2017. DOI: https://doi.org/10.24432/C5RK5G.
Dataset Splits	Yes	We partition the data temporally, reserving the last 12 months for testing and using the preceding 36 months for training.
Hardware Specification	Yes	All the experiments performed in this paper are run entirely on many different CPU clusters provided by the Grid 5000 testbed, with different types of CPU (e.g., Intel Xeon E5-2698, Intel Xeon E5-2620, Intel Xeon E5-2630).
Software Dependencies	No	All the software packages and datasets used for experiments in this paper are open-sourced, with the exact version provided. For the main experiments, we use 10 different random seeds, and report the average together with the 95% confidence interval. Further detailed instructions to run the experiments are provided in the supplementary material. we estimate and remove the seasonality via the STATSMODEL python package [52].
Experiment Setup	Yes	In Figure 1, we plot the trajectories of the gradient norm over the communication rounds for Minibatch SGD, Local SGD, Local SGD-M, and SCAFFOLD [27], which is the first algorithm proposed in the i.i.d. setting to mitigate heterogeneity in FL. We compare these methods under varying numbers of samples, K, per communication round. We observe that the performance of Minibatch SGD and Local SGD-M consistently improves as K increases, whereas Local SGD and SCAFFOLD exhibit little to no improvement. This is consistent with our theoretical findings, which identify heterogeneity as a limiting factor as K increases for Local SGD, but not for Minibatch SGD or Local SGD-M. For SCAFFOLD, we argue that its advantage is clearer in the Figure 1: Gradient norm as a function of the number of communication rounds for Local SGD, Minibatch SGD, Local SGD-M, and SCAFFOLD, with γ = 0.1, η = 0.01, β = 0.5, λ = 0.01 for 120 clients (each client has access to 12 consecutive months of training data) and different numbers of local steps.