Quasi-global Momentum: Accelerating Decentralized Deep Learning on Heterogeneous Data

Authors: Tao Lin, Sai Praneeth Karimireddy, Sebastian Stich, Martin Jaggi

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We show in extensive empirical experiments on various CV/NLP datasets (CIFAR-10, Image Net, and AG News) and several network topologies (Ring and Social Network) that our method is much more robust to the heterogeneity of clients data than other existing methods, by a significant improvement in test performance (1% 20%).
Researcher Affiliation Academia 1EPFL, Lausanne, Switzerland.
Pseudocode Yes Algorithm 1 highlights the difference between DSGDm and QG-DSGDm. Instead of using local gradients from heterogeneous data to form the local momentum (line 4 for DSGDm), which may significantly deflect from the global optimization direction, for QG-DSGDm, we use the differ-Algorithm 1 Decentralized learning algorithms: QG-DSGDm v.s. DSGDm ; Colors indicate the two alternative algorithm variants. At initialization m(0) i = ˆm(0) i := 0.
Open Source Code Yes Our code is publicly available1. 1Code: github.com/epfml/quasi-global-momentum
Open Datasets Yes Image classification (CV) benchmark: we consider training CIFAR-10 (Krizhevsky & Hinton, 2009), Image Net-32 (i.e. image resolution of 32) (Chrabaszcz et al., 2017), and Image Net (Deng et al., 2009) from scratch, with standard data augmentation and preprocessing scheme (He et al., 2016). ...Text classification (NLP) benchmark: we perform fine-tuning on a 4-class classification dataset (AG News (Zhang et al., 2015)).
Dataset Splits No We report the averaged performance of local models on the full test dataset. The models are trained for 300 and 90 epochs for CIFAR-10 and Image Net(-32) respectively; the local mini-batch size are set to 32 and 64. The test top-1 accuracy results in the table are averaged over three random seeds, with learning rate tuning for each setting. The paper describes generating non-i.i.d. client data using Dirichlet distribution, but does not specify explicit train/validation/test splits by percentages or sample counts.
Hardware Specification No The paper does not provide specific hardware details such as GPU models, CPU types, or memory specifications used for running the experiments.
Software Dependencies No The paper mentions software components like Hugging Face (Wolf et al., 2019) implying a Python environment, but does not provide specific version numbers for any software, libraries, or dependencies.
Experiment Setup Yes For the CV benchmark, the models are trained for 300 and 90 epochs for CIFAR-10 and Image Net(-32) respectively; the local mini-batch size are set to 32 and 64. All experiments use the SOTA learning rate scheme in distributed deep learning training (Goyal et al., 2017; He et al., 2019) with learning rate scaling and warm-up. The learning rate is always gradually warmed up from a relatively small value (i.e. 0.1) for the first 5 epochs. Besides, the learning rate will be divided by 10 when the model has accessed specified fractions of the total number of training samples { 1/4} for CIFAR and { 1/9} for Image Net. For the NLP benchmark, we fine-tune the distilbert-base-uncased from Hugging Face (Wolf et al., 2019) with constant learning rate and mini-batch of size 32 for 10 epochs. ... We use constant weight decay (1e-4). Regarding momentum related hyper-parameters, we follow the common practice in the community (β =0.9 and without dampening for Nesterov/Heavy Ball momentum variants, and β1 =0.9, β2 =0.99 for Adam variants). ... We use the Dirichlet distribution to create disjoint non-i.i.d. client training data... The degree of non-i.i.d.-ness is controlled by the value of α.