Staleness-Aware Async-SGD for Distributed Deep Learning

Authors: Wei Zhang, Suyog Gupta, Xiangru Lian, Ji Liu

IJCAI 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental verification is performed on commonly-used image classification benchmarks: CIFAR10 and Imagenet to demonstrate the superior effectiveness of the proposed approach, compared to SSGD (Synchronous SGD) and the conventional ASGD algorithm.
Researcher Affiliation Collaboration Wei Zhang, Suyog Gupta IBM T. J. Watson Research Center Yorktown Heights, NY 10598, USA {weiz,suyog}@us.ibm.com Xiangru Lian, Ji Liu Department of Computer Science University of Rochester, NY 14627, USA {lianxiangru,ji.liu.uwisc}@gmail.com
Pseudocode No No structured pseudocode or algorithm blocks were found.
Open Source Code No No explicit statement about the release of source code or a link to a code repository was found.
Open Datasets Yes We present results on two datasets: CIFAR10 [Krizhevsky and Hinton, 2009] and Image Net [Russakovsky et al., 2015].
Dataset Splits Yes The CIFAR10 [Krizhevsky and Hinton, 2009] dataset comprises of a total of 60,000 RGB images of size 32 32 pixels partitioned into the training set (50,000 images) and the test set (10,000 images). ... The training set is a subset of the Image Net database and contains 1.2 million 256 256 pixel images. The validation dataset has 50,000 images.
Hardware Specification Yes We deploy our implementation on a P775 supercomputer. Each node of this system contains four eight-core 3.84 GHz IBM POWER7 processors, one optical connect controller chip and 128 GB of memory. A single node has a theoretical floating point peak performance of 982 Gflop/s, memory bandwidth of 512 GB/s and bi-directional interconnect bandwidth of 192 GB/s.
Software Dependencies No The paper mentions using 'MPI' and the 'open-source Caffe deep learning package' but does not provide specific version numbers for these or any other software dependencies.
Experiment Setup Yes When using a single learner, the mini-batch size is set to 128 and training for 140 epochs using momentum accelerated SGD (momentum = 0.9) results in a model... The base learning rate 0 is set to 0.001 and reduced by a factor of 10 after the 120th and 130th epoch. In order to achieve comparable model accuracy as the single-learner, we follow the prescription of [Gupta et al., 2015] and reduce the minibatch size per learner as more learners are added to the system in order to keep the product of mini-batch size and number of learners approximately invariant. ... With a single learner, training with mini-batch size of 256, momentum 0.9 results in top-1 error of 42.56% and top-5 error of 19.18% on the validation set at the end of 35 epochs. The initial learning rate 0 is set to 0.01 and reduced by a factor of 5 after the 20th and again after the 30th epoch.