reproducibilityindex.ai

Quasi-global Momentum: Accelerating Decentralized Deep Learning on Heterogeneous Data

Authors: Tao Lin, Sai Praneeth Karimireddy, Sebastian Stich, Martin Jaggi

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We show in extensive empirical experiments on various CV/NLP datasets (CIFAR-10, Image Net, and AG News) and several network topologies (Ring and Social Network) that our method is much more robust to the heterogeneity of clients data than other existing methods, by a signiﬁcant improvement in test performance (1% 20%).
Researcher Affiliation	Academia	1EPFL, Lausanne, Switzerland.
Pseudocode	Yes	Algorithm 1 highlights the difference between DSGDm and QG-DSGDm. Instead of using local gradients from heterogeneous data to form the local momentum (line 4 for DSGDm), which may signiﬁcantly deﬂect from the global optimization direction, for QG-DSGDm, we use the differ-Algorithm 1 Decentralized learning algorithms: QG-DSGDm v.s. DSGDm ; Colors indicate the two alternative algorithm variants. At initialization m(0) i = ˆm(0) i := 0.
Open Source Code	Yes	Our code is publicly available1. 1Code: github.com/epfml/quasi-global-momentum
Open Datasets	Yes	Image classiﬁcation (CV) benchmark: we consider training CIFAR-10 (Krizhevsky & Hinton, 2009), Image Net-32 (i.e. image resolution of 32) (Chrabaszcz et al., 2017), and Image Net (Deng et al., 2009) from scratch, with standard data augmentation and preprocessing scheme (He et al., 2016). ...Text classiﬁcation (NLP) benchmark: we perform ﬁne-tuning on a 4-class classiﬁcation dataset (AG News (Zhang et al., 2015)).
Dataset Splits	No	We report the averaged performance of local models on the full test dataset. The models are trained for 300 and 90 epochs for CIFAR-10 and Image Net(-32) respectively; the local mini-batch size are set to 32 and 64. The test top-1 accuracy results in the table are averaged over three random seeds, with learning rate tuning for each setting. The paper describes generating non-i.i.d. client data using Dirichlet distribution, but does not specify explicit train/validation/test splits by percentages or sample counts.
Hardware Specification	No	The paper does not provide specific hardware details such as GPU models, CPU types, or memory specifications used for running the experiments.
Software Dependencies	No	The paper mentions software components like Hugging Face (Wolf et al., 2019) implying a Python environment, but does not provide specific version numbers for any software, libraries, or dependencies.
Experiment Setup	Yes	For the CV benchmark, the models are trained for 300 and 90 epochs for CIFAR-10 and Image Net(-32) respectively; the local mini-batch size are set to 32 and 64. All experiments use the SOTA learning rate scheme in distributed deep learning training (Goyal et al., 2017; He et al., 2019) with learning rate scaling and warm-up. The learning rate is always gradually warmed up from a relatively small value (i.e. 0.1) for the ﬁrst 5 epochs. Besides, the learning rate will be divided by 10 when the model has accessed speciﬁed fractions of the total number of training samples { 1/4} for CIFAR and { 1/9} for Image Net. For the NLP benchmark, we ﬁne-tune the distilbert-base-uncased from Hugging Face (Wolf et al., 2019) with constant learning rate and mini-batch of size 32 for 10 epochs. ... We use constant weight decay (1e-4). Regarding momentum related hyper-parameters, we follow the common practice in the community (β =0.9 and without dampening for Nesterov/Heavy Ball momentum variants, and β1 =0.9, β2 =0.99 for Adam variants). ... We use the Dirichlet distribution to create disjoint non-i.i.d. client training data... The degree of non-i.i.d.-ness is controlled by the value of α.