Faster Distributed Deep Net Training: Computation and Communication Decoupled Stochastic Gradient Descent
Authors: Shuheng Shen, Linli Xu, Jingchang Liu, Xianfeng Liang, Yifei Cheng
IJCAI 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on deep neural network training demonstrate the significant improvements of Co Co D-SGD: when training Res Net18 and VGG16 with 16 Geforce GTX 1080Ti GPUs, Co Co D-SGD is up to 2-3 faster than traditional synchronous SGD. In this section, we validate the performance of Co Co D-SGD in both homogeneous and heterogeneous environments. |
| Researcher Affiliation | Academia | 1Anhui Province Key Laboratory of Big Data Analysis and Application 2School of Computer Science and Technology, University of Science and Technology of China 3School of Data Science, University of Science and Technology of China |
| Pseudocode | Yes | Algorithm 1 Co Co D-SGD |
| Open Source Code | Yes | It can be downloaded from the anonymous link: https://github.com/IJCAI19-Co Co D-SGD/Supplemental-Material |
| Open Datasets | Yes | CIFAR10 [Krizhevsky and Hinton, 2009]: it consists of a training set of 50, 000 images from 10 classes, and a test set of 10, 000 images. CIFAR100 [Krizhevsky and Hinton, 2009]: it is similar to CIFAR10 but has 100 classes. |
| Dataset Splits | No | CIFAR10 [Krizhevsky and Hinton, 2009]: it consists of a training set of 50, 000 images from 10 classes, and a test set of 10, 000 images. |
| Hardware Specification | Yes | Hardware. We evaluate Co Co D-SGD on a cluster where each node has 3 Nvidia Geforce GTX 1080Ti GPUs, 2 Xeon(R) E5-2620 cores and 64 GB memory. The cluster has 6 nodes, which are connected with a 56Gbps Infini Band network. |
| Software Dependencies | Yes | Software. We use Pytorch 0.4.1 [Paszke et al., 2017] to implement the algorithms in our experiments, and use Horovod 0.15.2 [Sergeev and Balso, 2018], Open MPI 3.1.2 , and NCCL 2.3.7 to conduct the GPUDirect communication with the Ring-All Reduce algorithm. |
| Experiment Setup | Yes | Hyper-parameters. Basic batch size: 32 for both Res Net18 and VGG16. Basic learning rate: For both networks we start the learning rate from 0.01 and decay it by a factor of 10 at the beginning of the 81st epoch. Momentum: 0.9. Weight decay: 10-4. Communication period and gradient staleness: Since the variance of stochastic gradients is higher in the beginning, we set the communication period to 1 for the first 10 epochs and 5 for the subsequential epochs. The staleness of gradients in Pipe-SGD is set to 1 as suggested in [Li et al., 2018]. |