reproducibilityindex.ai

Faster Distributed Deep Net Training: Computation and Communication Decoupled Stochastic Gradient Descent

Authors: Shuheng Shen, Linli Xu, Jingchang Liu, Xianfeng Liang, Yifei Cheng

IJCAI 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments on deep neural network training demonstrate the signiﬁcant improvements of Co Co D-SGD: when training Res Net18 and VGG16 with 16 Geforce GTX 1080Ti GPUs, Co Co D-SGD is up to 2-3 faster than traditional synchronous SGD. In this section, we validate the performance of Co Co D-SGD in both homogeneous and heterogeneous environments.
Researcher Affiliation	Academia	1Anhui Province Key Laboratory of Big Data Analysis and Application 2School of Computer Science and Technology, University of Science and Technology of China 3School of Data Science, University of Science and Technology of China
Pseudocode	Yes	Algorithm 1 Co Co D-SGD
Open Source Code	Yes	It can be downloaded from the anonymous link: https://github.com/IJCAI19-Co Co D-SGD/Supplemental-Material
Open Datasets	Yes	CIFAR10 [Krizhevsky and Hinton, 2009]: it consists of a training set of 50, 000 images from 10 classes, and a test set of 10, 000 images. CIFAR100 [Krizhevsky and Hinton, 2009]: it is similar to CIFAR10 but has 100 classes.
Dataset Splits	No	CIFAR10 [Krizhevsky and Hinton, 2009]: it consists of a training set of 50, 000 images from 10 classes, and a test set of 10, 000 images.
Hardware Specification	Yes	Hardware. We evaluate Co Co D-SGD on a cluster where each node has 3 Nvidia Geforce GTX 1080Ti GPUs, 2 Xeon(R) E5-2620 cores and 64 GB memory. The cluster has 6 nodes, which are connected with a 56Gbps Inﬁni Band network.
Software Dependencies	Yes	Software. We use Pytorch 0.4.1 [Paszke et al., 2017] to implement the algorithms in our experiments, and use Horovod 0.15.2 [Sergeev and Balso, 2018], Open MPI 3.1.2 , and NCCL 2.3.7 to conduct the GPUDirect communication with the Ring-All Reduce algorithm.
Experiment Setup	Yes	Hyper-parameters. Basic batch size: 32 for both Res Net18 and VGG16. Basic learning rate: For both networks we start the learning rate from 0.01 and decay it by a factor of 10 at the beginning of the 81st epoch. Momentum: 0.9. Weight decay: 10-4. Communication period and gradient staleness: Since the variance of stochastic gradients is higher in the beginning, we set the communication period to 1 for the ﬁrst 10 epochs and 5 for the subsequential epochs. The staleness of gradients in Pipe-SGD is set to 1 as suggested in [Li et al., 2018].