reproducibilityindex.ai

Asynchronous Stochastic Gradient Descent with Delay Compensation

Authors: Shuxin Zheng, Qi Meng, Taifeng Wang, Wei Chen, Nenghai Yu, Zhi-Ming Ma, Tie-Yan Liu

ICML 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluated the proposed algorithm on CIFAR-10 and Image Net datasets, and the experimental results demonstrate that DC-ASGD outperforms both synchronous SGD and asynchronous SGD, and nearly approaches the performance of sequential SGD.
Researcher Affiliation	Collaboration	1University of Science and Technology of China 2School of Mathematical Sciences, Peking University 3Microsoft Research 4Academy of Mathematics and Systems Science, Chinese Academy of Sciences.
Pseudocode	Yes	Algorithm 1 DC-ASGD: worker m
Open Source Code	No	For the DNN algorithm running on each worker, we chose Res Net (He et al., 2016) since it produces the state-of-the-art accuracy in many image related tasks and its implementation is available through open-source projects8. For the parallelization of Res Net across machines, we leveraged an open-source parameter server9. (Footnote 8: https://github.com/Kaiming He/ deep-residual-networks, Footnote 9: http://www.dmtk.io/)
Open Datasets	Yes	We used two datasets: CIFAR-10 (Hinton, 2007) and Image Net ILSVRC 2013 (Russakovsky et al., 2015).
Dataset Splits	Yes	The CIFAR-10 dataset consists of a training set of 50k images and a test set of 10k images in 10 classes.
Hardware Specification	Yes	The experiments were conducted on a GPU cluster interconnected with InfiniBand. Each node has four K40 Tesla GPU processors.
Software Dependencies	No	The paper mentions using Res Net and an open-source parameter server (dmtk.io) but does not provide specific version numbers for these or any other software dependencies.
Experiment Setup	Yes	For all the algorithms under investigation, we performed training for 160 epochs, with a mini-batch size of 128, and an initial learning rate which was reduced by ten times after 80 and 120 epochs following the practice in (He et al., 2016). We performed grid search for the hyper-parameter and the best test performances are obtained by choosing the initial learning rate η = 0.5, λ0 = 0.04 for DC-ASGD-c, and λ0 = 2, m = 0.95 for DC-ASGD-a.