reproducibilityindex.ai

SlowMo: Improving Communication-Efficient Distributed SGD with Slow Momentum

Authors: Jianyu Wang, Vinayak Tantia, Nicolas Ballas, Michael Rabbat

ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments on image classiﬁcation and machine translation tasks demonstrate that SLOWMO consistently yields improvements in optimization and generalization performance relative to the base optimizer, even when the additional overhead is amortized over many updates so that the SLOWMO runtime is on par with that of the base optimizer. We provide theoretical convergence guarantees showing that SLOWMO converges to a stationary point of smooth non-convex losses.
Researcher Affiliation	Collaboration	Jianyu Wang Department of Electrical and Computer Engineering Carnegie Mellon University Pittsburgh, PA 15213, USA jianyuw1@andrew.cmu.edu Vinayak Tantia, Nicolas Ballas & Michael Rabbat Facebook AI Research Montreal, Canada {tantia, ballasn, mikerabbat}@fb.com
Pseudocode	Yes	Algorithm 1: Slow Momentum
Open Source Code	No	The paper does not provide a direct link or explicit statement for the open-sourcing of their own method's code. It only mentions using existing implementations for SGP and OSGP.
Open Datasets	Yes	On CIFAR-10 (Krizhevsky et al., 2009), we train a Res Net-18... On Image Net (Krizhevsky et al., 2012), we train a Res Net-50... On WMT 16-En-De, we train a transformer model (Vaswani et al., 2017).
Dataset Splits	No	The paper mentions 'validation accuracy' and 'validation NLL' but does not explicitly detail the split percentages or sizes for training, validation, and test sets. It implies using standard splits for widely-known datasets, but does not state them.
Hardware Specification	Yes	All experiments use NVIDIA DGX-1 servers as worker nodes. Each server contains 8 NVIDIA V100 GPUs and the servers are internetworked via commodity 10 Gbps Ethernet.
Software Dependencies	Yes	All methods are implemented in Py Torch 1.0 (Paszke et al., 2017), and our experiments use CUDA 9.2, CUDNN 7.3, and NCCL 2.2.13.
Experiment Setup	Yes	The total mini-batch size is 4096, and we train for 200 epochs. The learning rate (γt) linearly increases during the ﬁrst 5 epochs, following the warm-up strategy in Goyal et al. (2017), and then decays by a factor of 10 at epochs 100, 150, and 175.