SlowMo: Improving Communication-Efficient Distributed SGD with Slow Momentum
Authors: Jianyu Wang, Vinayak Tantia, Nicolas Ballas, Michael Rabbat
ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on image classification and machine translation tasks demonstrate that SLOWMO consistently yields improvements in optimization and generalization performance relative to the base optimizer, even when the additional overhead is amortized over many updates so that the SLOWMO runtime is on par with that of the base optimizer. We provide theoretical convergence guarantees showing that SLOWMO converges to a stationary point of smooth non-convex losses. |
| Researcher Affiliation | Collaboration | Jianyu Wang Department of Electrical and Computer Engineering Carnegie Mellon University Pittsburgh, PA 15213, USA jianyuw1@andrew.cmu.edu Vinayak Tantia, Nicolas Ballas & Michael Rabbat Facebook AI Research Montreal, Canada {tantia, ballasn, mikerabbat}@fb.com |
| Pseudocode | Yes | Algorithm 1: Slow Momentum |
| Open Source Code | No | The paper does not provide a direct link or explicit statement for the open-sourcing of their own method's code. It only mentions using existing implementations for SGP and OSGP. |
| Open Datasets | Yes | On CIFAR-10 (Krizhevsky et al., 2009), we train a Res Net-18... On Image Net (Krizhevsky et al., 2012), we train a Res Net-50... On WMT 16-En-De, we train a transformer model (Vaswani et al., 2017). |
| Dataset Splits | No | The paper mentions 'validation accuracy' and 'validation NLL' but does not explicitly detail the split percentages or sizes for training, validation, and test sets. It implies using standard splits for widely-known datasets, but does not state them. |
| Hardware Specification | Yes | All experiments use NVIDIA DGX-1 servers as worker nodes. Each server contains 8 NVIDIA V100 GPUs and the servers are internetworked via commodity 10 Gbps Ethernet. |
| Software Dependencies | Yes | All methods are implemented in Py Torch 1.0 (Paszke et al., 2017), and our experiments use CUDA 9.2, CUDNN 7.3, and NCCL 2.2.13. |
| Experiment Setup | Yes | The total mini-batch size is 4096, and we train for 200 epochs. The learning rate (γt) linearly increases during the first 5 epochs, following the warm-up strategy in Goyal et al. (2017), and then decays by a factor of 10 at epochs 100, 150, and 175. |