Delayed Gradient Averaging: Tolerate the Communication Latency for Federated Learning
Authors: Ligeng Zhu, Hongzhou Lin, Yao Lu, Yujun Lin, Song Han
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We theoretically prove that DGA attains a similar convergence rate as Fed Avg, and empirically show that our algorithm can tolerate high network latency without compromising accuracy. Specifically, we benchmark the training speed on various vision (CIFAR, Image Net) and language tasks (Shakespeare), with both IID and non-IID partitions, and show DGA can bring 2.55 to 4.07 speedup. Moreover, we built a 16-node Raspberry Pi cluster and show that DGA can consistently speed up real-world federated learning applications. |
| Researcher Affiliation | Collaboration | Ligeng Zhu1 Hongzhou Lin2 Yao Lu3 Yujun Lin1 Song Han1 1MIT 2Amazon 3Google |
| Pseudocode | Yes | Algorithm 1 Delayed Gradient Averaging (DGA) |
| Open Source Code | No | The paper states 'We implement DGA in Py Torch framework [30] and choose Horovod [40] as the distributed training backend' but does not include an explicit statement about releasing their own DGA implementation code or provide a direct link to a code repository for their method. |
| Open Datasets | Yes | We evaluate the effectiveness of DGA on diverse tasks: Image classification on CIFAR-10 [19] and Image Net [20], next word prediction on Shakespheare [41]. |
| Dataset Splits | Yes | For non-i.i.d. experiments, we follow the partition used in [4,22] to split the dataset. In CIFAR and Image Net, we distribute the dataset such that each device only contains samples from two classes. In Shakespeare dataset, each role is considered as a data source and each device only has two sources. |
| Hardware Specification | Yes | On CIFAR-10 [19], we train a Mobilenet V2-0.25 [37] using 64 workers and each equips with single V100 GPU. |
| Software Dependencies | No | The paper mentions 'Py Torch framework [30]' and 'Horovod [40]' but does not provide specific version numbers for these software dependencies. |
| Experiment Setup | Yes | On CIFAR-10 [19], we train a Mobilenet V2-0.25 [37] using 64 workers and each equips with single V100 GPU. The training epochs is 200 and the batch size 64 per worker. The learning rate is initially set to NUM_GPUs 0.0125 and momentum β is 0.9. The learning rate linearly increases during the first 5 epochs, following the warm-up strategy in [8], and then decays with cosine anneal schedule. |