Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Delayed Gradient Averaging: Tolerate the Communication Latency for Federated Learning
Authors: Ligeng Zhu, Hongzhou Lin, Yao Lu, Yujun Lin, Song Han
NeurIPS 2021 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We theoretically prove that DGA attains a similar convergence rate as Fed Avg, and empirically show that our algorithm can tolerate high network latency without compromising accuracy. Specifically, we benchmark the training speed on various vision (CIFAR, Image Net) and language tasks (Shakespeare), with both IID and non-IID partitions, and show DGA can bring 2.55 to 4.07 speedup. Moreover, we built a 16-node Raspberry Pi cluster and show that DGA can consistently speed up real-world federated learning applications. |
| Researcher Affiliation | Collaboration | Ligeng Zhu1 Hongzhou Lin2 Yao Lu3 Yujun Lin1 Song Han1 1MIT 2Amazon 3Google |
| Pseudocode | Yes | Algorithm 1 Delayed Gradient Averaging (DGA) |
| Open Source Code | No | The paper states 'We implement DGA in Py Torch framework [30] and choose Horovod [40] as the distributed training backend' but does not include an explicit statement about releasing their own DGA implementation code or provide a direct link to a code repository for their method. |
| Open Datasets | Yes | We evaluate the effectiveness of DGA on diverse tasks: Image classification on CIFAR-10 [19] and Image Net [20], next word prediction on Shakespheare [41]. |
| Dataset Splits | Yes | For non-i.i.d. experiments, we follow the partition used in [4,22] to split the dataset. In CIFAR and Image Net, we distribute the dataset such that each device only contains samples from two classes. In Shakespeare dataset, each role is considered as a data source and each device only has two sources. |
| Hardware Specification | Yes | On CIFAR-10 [19], we train a Mobilenet V2-0.25 [37] using 64 workers and each equips with single V100 GPU. |
| Software Dependencies | No | The paper mentions 'Py Torch framework [30]' and 'Horovod [40]' but does not provide specific version numbers for these software dependencies. |
| Experiment Setup | Yes | On CIFAR-10 [19], we train a Mobilenet V2-0.25 [37] using 64 workers and each equips with single V100 GPU. The training epochs is 200 and the batch size 64 per worker. The learning rate is initially set to NUM_GPUs 0.0125 and momentum β is 0.9. The learning rate linearly increases during the first 5 epochs, following the warm-up strategy in [8], and then decays with cosine anneal schedule. |