reproducibilityindex.ai

Delayed Gradient Averaging: Tolerate the Communication Latency for Federated Learning

Authors: Ligeng Zhu, Hongzhou Lin, Yao Lu, Yujun Lin, Song Han

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We theoretically prove that DGA attains a similar convergence rate as Fed Avg, and empirically show that our algorithm can tolerate high network latency without compromising accuracy. Speciﬁcally, we benchmark the training speed on various vision (CIFAR, Image Net) and language tasks (Shakespeare), with both IID and non-IID partitions, and show DGA can bring 2.55 to 4.07 speedup. Moreover, we built a 16-node Raspberry Pi cluster and show that DGA can consistently speed up real-world federated learning applications.
Researcher Affiliation	Collaboration	Ligeng Zhu1 Hongzhou Lin2 Yao Lu3 Yujun Lin1 Song Han1 1MIT 2Amazon 3Google
Pseudocode	Yes	Algorithm 1 Delayed Gradient Averaging (DGA)
Open Source Code	No	The paper states 'We implement DGA in Py Torch framework [30] and choose Horovod [40] as the distributed training backend' but does not include an explicit statement about releasing their own DGA implementation code or provide a direct link to a code repository for their method.
Open Datasets	Yes	We evaluate the effectiveness of DGA on diverse tasks: Image classiﬁcation on CIFAR-10 [19] and Image Net [20], next word prediction on Shakespheare [41].
Dataset Splits	Yes	For non-i.i.d. experiments, we follow the partition used in [4,22] to split the dataset. In CIFAR and Image Net, we distribute the dataset such that each device only contains samples from two classes. In Shakespeare dataset, each role is considered as a data source and each device only has two sources.
Hardware Specification	Yes	On CIFAR-10 [19], we train a Mobilenet V2-0.25 [37] using 64 workers and each equips with single V100 GPU.
Software Dependencies	No	The paper mentions 'Py Torch framework [30]' and 'Horovod [40]' but does not provide specific version numbers for these software dependencies.
Experiment Setup	Yes	On CIFAR-10 [19], we train a Mobilenet V2-0.25 [37] using 64 workers and each equips with single V100 GPU. The training epochs is 200 and the batch size 64 per worker. The learning rate is initially set to NUM_GPUs 0.0125 and momentum β is 0.9. The learning rate linearly increases during the ﬁrst 5 epochs, following the warm-up strategy in [8], and then decays with cosine anneal schedule.