Stochastic Gradient Push for Distributed Deep Learning
Authors: Mahmoud Assran, Nicolas Loizou, Nicolas Ballas, Mike Rabbat
ICML 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We empirically validate the performance of SGP on image classification (Res Net-50, Image Net) and machine translation (Transformer, WMT 16 En De) workloads. |
| Researcher Affiliation | Collaboration | 1Facebook AI Research, Montr eal, QC, Canada 2Department of Electrical and Computer Engineering, Mc Gill University, Montr eal, QC, Canada 3School of Mathematics, University of Edinburgh, Edinburgh, Scotland. |
| Pseudocode | Yes | Pseudocode is shown in Alg. 1. |
| Open Source Code | Yes | Our code is available at [https://github.com/facebookresearch/stochastic gradient push]. |
| Open Datasets | Yes | We train a Res Net-50 (He et al., 2016) on the Image Net classification task (Russakovsky et al., 2015). We train a transformer network (Vaswani et al., 2017) on WMT16-En-De |
| Dataset Splits | Yes | We train a Res Net-50 (He et al., 2016) on the Image Net classification task (Russakovsky et al., 2015). We train a transformer network (Vaswani et al., 2017) on WMT16-En-De. The paper mentions 'validation accuracy' and 'validation curves', implying the use of standard benchmark dataset splits. |
| Hardware Specification | Yes | Our experiments use 32 NVIDIA DGX-1 servers. Each server has 8 V100 GPUs. |
| Software Dependencies | No | The paper mentions that 'All algorithms are implemented in Py Torch (Paszke et al., 2017),' but does not provide a specific version number for PyTorch or any other software component used. |
| Experiment Setup | Yes | Every node uses a mini-batch size of 256, so using more nodes corresponds to larger effective mini-batch size. Unless indicated otherwise, all experiments are run for 90 epochs, the learning rate warms up to n 0.1 during the first five epochs following Goyal et al. (2017) and is decayed by a factor of 10 at epochs 30, 60, and 80. All methods use Nesterov momentum. |