Asynchronous Decentralized SGD with Quantized and Local Updates
Authors: Giorgi Nadiradze, Amirmojtaba Sabour, Peter Davies, Shigang Li, Dan Alistarh
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We apply Swarm SGD to train deep neural networks on image classification and machine translation (NMT) tasks, deployed on the Piz Daint supercomputer [Piz, 2019]. Experiments confirm the intuition that the average synchronization cost of Swarm SGD per iteration is low: it stays at less than 10% of the batch computation time, and remains constant as we increase the number of nodes. For example, using Swarm SGD, we are able to train a Transformer XL [Vaswani et al., 2017] model on WMT17 (En-Ge) 1.5 faster than a highly-optimized largebatch SGD baseline, and to slightly higher accuracy, without additional hyper-parameter tuning. |
| Researcher Affiliation | Collaboration | Giorgi Nadiradze IST Austria giorgi.nadiradze@ist.ac.at Amirmojtaba Sabour IST Austria amsabour79@gmail.com Peter Davies University of Surrey pd0034@surrey.ac.uk Shigang Li ETH Zurich shigangli.cs@gmail.com Dan Alistarh IST Austria & Neural Magic dan.alistarh@ist.ac.at |
| Pseudocode | Yes | Algorithm 1 Sequential Swarm SGD pseudocode for each interaction between nodes i and j. |
| Open Source Code | Yes | (a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] |
| Open Datasets | Yes | We apply Swarm SGD to train deep neural networks on image classification and machine translation (NMT) tasks, deployed on the Piz Daint supercomputer [Piz, 2019]... train Res Nets on the CIFAR-10/100 Krizhevsky et al. [2014] and Image Net [Russakovsky et al., 2015] datasets, while we use the Tensor Flow implementation to train the original version of the Transformer-XL model [Vaswani et al., 2017] on the WMT17 (En-Ge) dataset. |
| Dataset Splits | No | The paper mentions 'approximate Top-1 validation accuracy recovery' and refers to standard datasets like CIFAR-10 and ImageNet which have common splits. It also states 'partition it among processes'. However, it does not explicitly provide specific percentages, sample counts, or direct citations for the train/validation/test splits used for reproduction. |
| Hardware Specification | Yes | We run Swarm SGD on the CSCS Piz Daint supercomputer, which is composed of Cray XC50 nodes, each with a Xeon E5-2690v3 CPU and an NVIDIA Tesla P100 GPU, using a state-of-the-art Aries interconnect over a Dragonfly network topology, which is regular. |
| Software Dependencies | No | The paper states 'We implemented Swarm SGD in Pytorch and Tensor Flow using MPI-based primitives' but does not specify version numbers for these software dependencies. |
| Experiment Setup | Yes | Our training methodology follows data-parallel training, with some differences due to decentralization, and is identical to previous work on decentralized and local SGD, e.g. Lian et al. [2017], Assran et al. [2018], Lin et al. [2018]... local batch sizes are 128 for Res Net20 and Res Net50, and 128 for Res Net18. (Quantization is not applied in these experiments.)... Swarm step count represents local SGD steps per model between two averaging steps, and epochs are counted in terms of total passes over the data by all nodes. ...the learning rate schedule, momentum and weight decay terms are identical to the standard values for sequential SGD, for each individual model. |