Faster Distributed Deep Net Training: Computation and Communication Decoupled Stochastic Gradient Descent

Authors: Shuheng Shen, Linli Xu, Jingchang Liu, Xianfeng Liang, Yifei Cheng

IJCAI 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on deep neural network training demonstrate the significant improvements of Co Co D-SGD: when training Res Net18 and VGG16 with 16 Geforce GTX 1080Ti GPUs, Co Co D-SGD is up to 2-3 faster than traditional synchronous SGD. In this section, we validate the performance of Co Co D-SGD in both homogeneous and heterogeneous environments.
Researcher Affiliation Academia 1Anhui Province Key Laboratory of Big Data Analysis and Application 2School of Computer Science and Technology, University of Science and Technology of China 3School of Data Science, University of Science and Technology of China
Pseudocode Yes Algorithm 1 Co Co D-SGD
Open Source Code Yes It can be downloaded from the anonymous link: https://github.com/IJCAI19-Co Co D-SGD/Supplemental-Material
Open Datasets Yes CIFAR10 [Krizhevsky and Hinton, 2009]: it consists of a training set of 50, 000 images from 10 classes, and a test set of 10, 000 images. CIFAR100 [Krizhevsky and Hinton, 2009]: it is similar to CIFAR10 but has 100 classes.
Dataset Splits No CIFAR10 [Krizhevsky and Hinton, 2009]: it consists of a training set of 50, 000 images from 10 classes, and a test set of 10, 000 images.
Hardware Specification Yes Hardware. We evaluate Co Co D-SGD on a cluster where each node has 3 Nvidia Geforce GTX 1080Ti GPUs, 2 Xeon(R) E5-2620 cores and 64 GB memory. The cluster has 6 nodes, which are connected with a 56Gbps Infini Band network.
Software Dependencies Yes Software. We use Pytorch 0.4.1 [Paszke et al., 2017] to implement the algorithms in our experiments, and use Horovod 0.15.2 [Sergeev and Balso, 2018], Open MPI 3.1.2 , and NCCL 2.3.7 to conduct the GPUDirect communication with the Ring-All Reduce algorithm.
Experiment Setup Yes Hyper-parameters. Basic batch size: 32 for both Res Net18 and VGG16. Basic learning rate: For both networks we start the learning rate from 0.01 and decay it by a factor of 10 at the beginning of the 81st epoch. Momentum: 0.9. Weight decay: 10-4. Communication period and gradient staleness: Since the variance of stochastic gradients is higher in the beginning, we set the communication period to 1 for the first 10 epochs and 5 for the subsequential epochs. The staleness of gradients in Pipe-SGD is set to 1 as suggested in [Li et al., 2018].