Delayed Gradient Averaging: Tolerate the Communication Latency for Federated Learning

Authors: Ligeng Zhu, Hongzhou Lin, Yao Lu, Yujun Lin, Song Han

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We theoretically prove that DGA attains a similar convergence rate as Fed Avg, and empirically show that our algorithm can tolerate high network latency without compromising accuracy. Specifically, we benchmark the training speed on various vision (CIFAR, Image Net) and language tasks (Shakespeare), with both IID and non-IID partitions, and show DGA can bring 2.55 to 4.07 speedup. Moreover, we built a 16-node Raspberry Pi cluster and show that DGA can consistently speed up real-world federated learning applications.
Researcher Affiliation Collaboration Ligeng Zhu1 Hongzhou Lin2 Yao Lu3 Yujun Lin1 Song Han1 1MIT 2Amazon 3Google
Pseudocode Yes Algorithm 1 Delayed Gradient Averaging (DGA)
Open Source Code No The paper states 'We implement DGA in Py Torch framework [30] and choose Horovod [40] as the distributed training backend' but does not include an explicit statement about releasing their own DGA implementation code or provide a direct link to a code repository for their method.
Open Datasets Yes We evaluate the effectiveness of DGA on diverse tasks: Image classification on CIFAR-10 [19] and Image Net [20], next word prediction on Shakespheare [41].
Dataset Splits Yes For non-i.i.d. experiments, we follow the partition used in [4,22] to split the dataset. In CIFAR and Image Net, we distribute the dataset such that each device only contains samples from two classes. In Shakespeare dataset, each role is considered as a data source and each device only has two sources.
Hardware Specification Yes On CIFAR-10 [19], we train a Mobilenet V2-0.25 [37] using 64 workers and each equips with single V100 GPU.
Software Dependencies No The paper mentions 'Py Torch framework [30]' and 'Horovod [40]' but does not provide specific version numbers for these software dependencies.
Experiment Setup Yes On CIFAR-10 [19], we train a Mobilenet V2-0.25 [37] using 64 workers and each equips with single V100 GPU. The training epochs is 200 and the batch size 64 per worker. The learning rate is initially set to NUM_GPUs 0.0125 and momentum β is 0.9. The learning rate linearly increases during the first 5 epochs, following the warm-up strategy in [8], and then decays with cosine anneal schedule.