Tackling the Data Heterogeneity in Asynchronous Federated Learning with Cached Update Calibration

Authors: Yujia Wang, Yuanpu Cao, Jingcheng Wu, Ruoyu Chen, Jinghui Chen

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through the theoretical convergence analysis of one representative asynchronous federated learning algorithm under standard nonconvex stochastic settings, we show that the asynchronous delay can largely slow down the convergence, especially with high data heterogeneity. To further improve the convergence of asynchronous federated learning under heterogeneous data distributions, we propose a novel asynchronous federated learning method with a cached update calibration. Specifically, we let the server cache the latest update for each client and reuse these variables for calibrating the global update at each round. We theoretically prove the convergence acceleration for our proposed method under nonconvex stochastic settings. Extensive experiments on several vision and language tasks demonstrate our superior performances compared to other asynchronous federated learning baselines.
Researcher Affiliation Academia Yujia Wang1, Yuanpu Cao1, Jingcheng Wu2, Ruoyu Chen2, Jinghui Chen1 1The Pennsylvania State University 2Carnegie Mellon University {yjw5427, ymc5533}@psu.edu, {jingchew, ruoyuche}@andrew.cmu.edu, jzc5917@psu.edu
Pseudocode Yes Algorithm 1 Fed Buff without DP, Algorithm 2 Cached-Aided Asynchronous FL
Open Source Code No The paper does not provide a direct link to a code repository or an explicit statement about the public release of the code.
Open Datasets Yes For the vision tasks, we train the CIFAR-10 dataset with CNN (Wang & Ji, 2022) and Res Net-18 (He et al., 2016) models, and we also train CIFAR-100 (Krizhevsky et al., 2009) datasets with Res Net-18 model, and we provide various data sampling levels and client concurrency settings. For the language tasks, we conduct experiments on fine-tuning a pretrained Bert-base model (Devlin et al., 2018) on several datasets in GLUE benchmark (Wang et al., 2018).
Dataset Splits Yes We evaluate experiments on non-i.i.d. data distributions by a Dirichlet distribution partitioned strategy similar to (Wang et al., 2020a;b) with several parameters for both vision and language tasks.
Hardware Specification Yes All experiments in this paper are conducted on 4 NVIDIA RTX A6000 GPUs.
Software Dependencies No The paper mentions optimizers like SGD and AdamW but does not provide specific version numbers for these or other software libraries (e.g., PyTorch, TensorFlow).
Experiment Setup Yes For experiments on CIFAR-10 and CIFAR-100, the number of local training iterations K on each client is set to two local epochs (the amount of iteration depends on the amount of data for each client, and the batch size is set to 50 for all experiments by default). For local update, we use the SGD optimizer with a learning rate gridding from {0.001, 0.01, 0.1, 1} with momentum 0.9 and weight decay of 1e-4, and the global learning rate is gridding from {0.1, 1.0, 2.0} for all methods. We set a total of 100 clients in the network and the concurrency Mc = 20 if there is no further instructions, and we set the update accumulation amount M = 10 by default.