Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

A First Look into the Carbon Footprint of Federated Learning

Authors: Xinchi Qiu, Titouan Parcollet, Javier Fernandez-Marques, Pedro P. B. Gusmao, Yan Gao, Daniel J. Beutel, Taner Topal, Akhil Mathur, Nicholas D. Lane

JMLR 2023 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive Experiments. Carbon sensitivity analysis is conducted with this method on real FL hardware under different settings, strategies, and tasks (Section 4). We demonstrate that CO2e emissions depend on a wide range of hyper-parameters and that emissions derived from communication between clients and server can represent from 0.7% up to more than 96% of total emission. When compared to centralized training, we show that for different tasks and settings, FL can emit from 72% to hundreds of times more carbon than its centralized version.
Researcher Affiliation	Collaboration	Xinchi Qiu EMAIL Titouan Parcollet , EMAIL Javier Fernandez-Marques EMAIL Pedro P. B. Gusmao EMAIL Yan Gao EMAIL Daniel J. Beutel , EMAIL Taner Topal , EMAIL Akhil Mathur EMAIL Nicholas D. Lane , EMAIL Department of Computer Science and Technology, University of Cambridge Laboratoire Informatique d Avignon, Avignon Université Department of Computer Science, University of Oxford Flower Labs GmbH Nokia Bell Labs
Pseudocode	No	The paper describes methodologies using equations and text, but it does not contain a dedicated pseudocode block or algorithm section.
Open Source Code	No	The paper states: "We make use of the Flower framework (Beutel et al., 2020) to implement and parameterized different FL training pipelines." However, it does not explicitly state that the code specific to the experiments presented in this paper is open-source or provide a link to it. The reference to the Flower framework itself is a third-party tool.
Open Datasets	Yes	This article provides extensive estimates across different types of tasks and datasets, including image classification with CIFAR10 (Krizhevsky et al., 2009), FEMNIST (Le Cun, 1998; Cohen et al., 2017), and Image Net (Russakovsky et al., 2015), speech processing with keyword spotting on Speech Commands (Warden, 2018) and speech recognition with Common Voice (Ardila et al., 2020).
Dataset Splits	Yes	We consider a pool of 500 client for CIFAR10 with 10 active clients training concurrently per round.We split Image Net and Speech Commands into 100 clients and randomly select 10 clients per round. As for FEMNIST and CV Italian, there are 3597 and 649 natural clients respectively, and we select 35 and 10 clients in each communication round. For Image Net, we chose α = 1000 for the IID dataset partition and α = 0.5 for non-IID following Yurochkin et al. (2019) and Hsu et al. (2019). For CIFAR10, we choose α = 0.1 following the same protocol as Reddi et al. (2021). As for Speech Commands, in light of the unbalanced nature of the dataset, we propose to change the prior of LDA from uniform distribution to multinomial distribution. Hence the LDA can be summarized as: N , ..., Nm q Dir(αp), (7) where Ni stands for the number of data from class i, N stands for total number of data in the dataset. According to Yurochkin et al. (2019); Hsu et al. (2019), α is commonly set as 0.5 for a non-IID partition of balanced dataset. Given the aforementioned unbalanced nature of the dataset, we propose to match the variance of 10 keywords classes with multinomial prior to the variance of 10 keywords classes with a uniform prior by changing α to 1.0. In practice, a non-IID dataset can mean both class-imbalance and feature-imbalance among clients. Other latent factors can change such as the user accent or voice timbre in speech recognition or different calligraphy styles in hand-written text. Therefore, we also include two naturally partitioned datasets FEMNIST and CV Italian to capture the feature imbalanced datasets. For CV Italian, we first pre-train the model on half of the data samples in a centralized fashion. We do this by partitioning the original dataset into a small subset of speakers (99) for centralized training and a larger subset of speakers (649) for the FL experiment. Then, we simulate a scenario of single speaker using their individual devices by naturally dividing the training sets based on users ID into 649 partitions. We followed the paritioning methodology in Caldas et al. (2018) to extract the FEMNIST dataset from EMNIST following a natural partitioning by writer id.
Hardware Specification	Yes	Centralized training hardware. We run our experiments on a server equipped with two Xeon 6152 22-core processors and NVIDIA Tesla V100 32GB GPUs. Federated learning hardware. We consider the use of NVIDIA Tegra X2 (Smith, 2017) and Jetson Xavier NX (Smith, 2019) devices as our FL clients.
Software Dependencies	No	Experiments are built on top of Py Torch (Paszke et al., 2019) and Speech Brain (Ravanelli et al., 2021). We make use of the Flower framework (Beutel et al., 2020) to implement and parameterized different FL training pipelines. Specific version numbers for these software components are not explicitly stated in the text.
Experiment Setup	Yes	These models are trained with SGD but only the centralized setting makes use of momentum. For the sake of completeness, we choose to use different deep learning model for the Speech Commands dataset. We employ 4 layers of LSTM each with 256 nodes. The models are trained using Adam optimization. Also, the hyper-parameters, such as learning rates, are set to be the same as centralized learning without further tuning. For CIFAR10, we follow the experimental protocol proposed in Reddi et al. (2021) considering the suggested best values for η, ηl, and τ in almost every experiment except for Fed AVG, where we had to lower the value of ηl to 10 3/2 to allow training. All other experiments used a server learning rate η = 0.1 and τ = 0.001. Local epoch (LE). We also propose to vary the number of local epochs done on each client to better highlight the contribution of the local computations to the total emissions. To be consistent, we choose to do 1 and 5 local epochs across all tasks except ASR task (insisting with 5 local epochs to obtain acceptable performance). Target accuracies. We set the target accuracies for CIFAR10, FEMNIST and Image Net to be 70%, 80% and 50% top-1 accuracy respectively. For Speech Commands, the threshold is set to 70%, and for CV Italian, the target is set to be 25% of Word Error Rate (WER).