Staleness-Aware Async-SGD for Distributed Deep Learning
Authors: Wei Zhang, Suyog Gupta, Xiangru Lian, Ji Liu
IJCAI 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental verification is performed on commonly-used image classification benchmarks: CIFAR10 and Imagenet to demonstrate the superior effectiveness of the proposed approach, compared to SSGD (Synchronous SGD) and the conventional ASGD algorithm. |
| Researcher Affiliation | Collaboration | Wei Zhang, Suyog Gupta IBM T. J. Watson Research Center Yorktown Heights, NY 10598, USA {weiz,suyog}@us.ibm.com Xiangru Lian, Ji Liu Department of Computer Science University of Rochester, NY 14627, USA {lianxiangru,ji.liu.uwisc}@gmail.com |
| Pseudocode | No | No structured pseudocode or algorithm blocks were found. |
| Open Source Code | No | No explicit statement about the release of source code or a link to a code repository was found. |
| Open Datasets | Yes | We present results on two datasets: CIFAR10 [Krizhevsky and Hinton, 2009] and Image Net [Russakovsky et al., 2015]. |
| Dataset Splits | Yes | The CIFAR10 [Krizhevsky and Hinton, 2009] dataset comprises of a total of 60,000 RGB images of size 32 32 pixels partitioned into the training set (50,000 images) and the test set (10,000 images). ... The training set is a subset of the Image Net database and contains 1.2 million 256 256 pixel images. The validation dataset has 50,000 images. |
| Hardware Specification | Yes | We deploy our implementation on a P775 supercomputer. Each node of this system contains four eight-core 3.84 GHz IBM POWER7 processors, one optical connect controller chip and 128 GB of memory. A single node has a theoretical floating point peak performance of 982 Gflop/s, memory bandwidth of 512 GB/s and bi-directional interconnect bandwidth of 192 GB/s. |
| Software Dependencies | No | The paper mentions using 'MPI' and the 'open-source Caffe deep learning package' but does not provide specific version numbers for these or any other software dependencies. |
| Experiment Setup | Yes | When using a single learner, the mini-batch size is set to 128 and training for 140 epochs using momentum accelerated SGD (momentum = 0.9) results in a model... The base learning rate 0 is set to 0.001 and reduced by a factor of 10 after the 120th and 130th epoch. In order to achieve comparable model accuracy as the single-learner, we follow the prescription of [Gupta et al., 2015] and reduce the minibatch size per learner as more learners are added to the system in order to keep the product of mini-batch size and number of learners approximately invariant. ... With a single learner, training with mini-batch size of 256, momentum 0.9 results in top-1 error of 42.56% and top-5 error of 19.18% on the validation set at the end of 35 epochs. The initial learning rate 0 is set to 0.01 and reduced by a factor of 5 after the 20th and again after the 30th epoch. |