Revisiting Optimal Convergence Rate for Smooth and Non-convex Stochastic Decentralized Optimization

Authors: Kun Yuan, Xinmeng Huang, Yiming Chen, Xiaohan Zhang, Yingya Zhang, Pan Pan

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental This section will validate our theoretical results by empirically comparing different decentralized algorithms DSGD [39], D2 [67], DSGT [82], De TAG [44] and MG-DSGD in deep learning.
Researcher Affiliation Collaboration 1DAMO Academy, Alibaba Group 2University of Pennsylvania 3Peking University 4Meta Carbon
Pseudocode Yes Algorithm 1: Decentralized SGD with multiple gossip steps (MG-DSGD) Algorithm 2: xi = Fast Gossip Average({ϕi}n i=1, W, R)
Open Source Code No Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [No]
Open Datasets Yes A series of experiments are carried out with CIFAR-10 [34] and Image Net [16] to compare the aforementioned methods.
Dataset Splits Yes For CIFAR-10 dataset, it consists of 50,000 training images and 10,000 validation images in 10 classes. For Image Net dataset, it consists of 1,281,167 training images and 50,000 validation images in 1000 classes.
Hardware Specification Yes All the models and training scripts in this section run on servers with 8 NVIDIA V100 GPUs with each GPU treated as one node.
Software Dependencies Yes We implement all decentralized algorithms with Py Torch [53] 1.6.0 using NCCL 2.8.3 (CUDA 10.1) as the communication backend
Experiment Setup Yes For training protocol, we train total 300 epochs and the learning rate is warmed up in the first 5 epochs and is decayed by a factor of 10 at 150 and 250-th epoch. For learning rate, we tuned a strong baseline in the PSGD setting (5e-3 for single node) and used the same setting in all decentralized methods. The batch size is set to 128 on each node.