Asynchronous Stochastic Gradient Descent with Delay Compensation
Authors: Shuxin Zheng, Qi Meng, Taifeng Wang, Wei Chen, Nenghai Yu, Zhi-Ming Ma, Tie-Yan Liu
ICML 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluated the proposed algorithm on CIFAR-10 and Image Net datasets, and the experimental results demonstrate that DC-ASGD outperforms both synchronous SGD and asynchronous SGD, and nearly approaches the performance of sequential SGD. |
| Researcher Affiliation | Collaboration | 1University of Science and Technology of China 2School of Mathematical Sciences, Peking University 3Microsoft Research 4Academy of Mathematics and Systems Science, Chinese Academy of Sciences. |
| Pseudocode | Yes | Algorithm 1 DC-ASGD: worker m |
| Open Source Code | No | For the DNN algorithm running on each worker, we chose Res Net (He et al., 2016) since it produces the state-of-the-art accuracy in many image related tasks and its implementation is available through open-source projects8. For the parallelization of Res Net across machines, we leveraged an open-source parameter server9. (Footnote 8: https://github.com/Kaiming He/ deep-residual-networks, Footnote 9: http://www.dmtk.io/) |
| Open Datasets | Yes | We used two datasets: CIFAR-10 (Hinton, 2007) and Image Net ILSVRC 2013 (Russakovsky et al., 2015). |
| Dataset Splits | Yes | The CIFAR-10 dataset consists of a training set of 50k images and a test set of 10k images in 10 classes. |
| Hardware Specification | Yes | The experiments were conducted on a GPU cluster interconnected with InfiniBand. Each node has four K40 Tesla GPU processors. |
| Software Dependencies | No | The paper mentions using Res Net and an open-source parameter server (dmtk.io) but does not provide specific version numbers for these or any other software dependencies. |
| Experiment Setup | Yes | For all the algorithms under investigation, we performed training for 160 epochs, with a mini-batch size of 128, and an initial learning rate which was reduced by ten times after 80 and 120 epochs following the practice in (He et al., 2016). We performed grid search for the hyper-parameter and the best test performances are obtained by choosing the initial learning rate η = 0.5, λ0 = 0.04 for DC-ASGD-c, and λ0 = 2, m = 0.95 for DC-ASGD-a. |