SlowMo: Improving Communication-Efficient Distributed SGD with Slow Momentum

Authors: Jianyu Wang, Vinayak Tantia, Nicolas Ballas, Michael Rabbat

ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on image classification and machine translation tasks demonstrate that SLOWMO consistently yields improvements in optimization and generalization performance relative to the base optimizer, even when the additional overhead is amortized over many updates so that the SLOWMO runtime is on par with that of the base optimizer. We provide theoretical convergence guarantees showing that SLOWMO converges to a stationary point of smooth non-convex losses.
Researcher Affiliation Collaboration Jianyu Wang Department of Electrical and Computer Engineering Carnegie Mellon University Pittsburgh, PA 15213, USA jianyuw1@andrew.cmu.edu Vinayak Tantia, Nicolas Ballas & Michael Rabbat Facebook AI Research Montreal, Canada {tantia, ballasn, mikerabbat}@fb.com
Pseudocode Yes Algorithm 1: Slow Momentum
Open Source Code No The paper does not provide a direct link or explicit statement for the open-sourcing of their own method's code. It only mentions using existing implementations for SGP and OSGP.
Open Datasets Yes On CIFAR-10 (Krizhevsky et al., 2009), we train a Res Net-18... On Image Net (Krizhevsky et al., 2012), we train a Res Net-50... On WMT 16-En-De, we train a transformer model (Vaswani et al., 2017).
Dataset Splits No The paper mentions 'validation accuracy' and 'validation NLL' but does not explicitly detail the split percentages or sizes for training, validation, and test sets. It implies using standard splits for widely-known datasets, but does not state them.
Hardware Specification Yes All experiments use NVIDIA DGX-1 servers as worker nodes. Each server contains 8 NVIDIA V100 GPUs and the servers are internetworked via commodity 10 Gbps Ethernet.
Software Dependencies Yes All methods are implemented in Py Torch 1.0 (Paszke et al., 2017), and our experiments use CUDA 9.2, CUDNN 7.3, and NCCL 2.2.13.
Experiment Setup Yes The total mini-batch size is 4096, and we train for 200 epochs. The learning rate (γt) linearly increases during the first 5 epochs, following the warm-up strategy in Goyal et al. (2017), and then decays by a factor of 10 at epochs 100, 150, and 175.