reproducibilityindex.ai

Communication-Efficient Distributed Blockwise Momentum SGD with Error-Feedback

Authors: Shuai Zheng, Ziyue Huang, James Kwok

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results show that the proposed method converges as fast as full-precision distributed momentum SGD and achieves the same testing accuracy. In particular, on distributed Res Net training with 7 workers on the Image Net, the proposed algorithm achieves the same testing accuracy as momentum SGD using full-precision gradients, but with 46% less wall clock time.
Researcher Affiliation	Collaboration	Shuai Zheng 1,2, Ziyue Huang1, James T. Kwok1 shzheng@amazon.com, {zhuangbq, jamesk}@cse.ust.hk 1Department of Computer Science and Engineering Hong Kong University of Science and Technology 2Amazon Web Services
Pseudocode	Yes	Algorithm 2 Distributed SGD with Error-Feedback (dist-EF-SGD); Algorithm 3 Distributed Blockwise SGD with Error-Feedback (dist-EF-block SGD); Algorithm 4 Distributed Blockwise Momentum SGD with Error-Feedback (dist-EF-block SGDM)
Open Source Code	No	The paper mentions using 'publicly available code3 in [4]' for comparisons, but does not provide its own source code for the methodology described.
Open Datasets	Yes	Experiment is performed on the CIFAR-100 dataset, with 50K training images and 10K test images. [...] In this section, we perform distributed optimization on Image Net [15] using a 50-layer Res Net.
Dataset Splits	No	The paper mentions '50K training images and 10K test images' for CIFAR-100 but does not specify a separate validation set split.
Hardware Specification	Yes	For faster experimentation, we use a single node with multiple GPUs (an AWS P3.16 instance with 8 Nvidia V100 GPUs, each GPU being a worker) instead of a distributed setting. [...] Each worker is an AWS P3.2 instance with 1 GPU, and the parameter server is housed in one node.
Software Dependencies	No	The paper mentions 'MXNet', 'Py Torch', and 'Gloo communication library' but does not specify version numbers for any of these software dependencies.
Experiment Setup	Yes	We vary the mini-batch size per worker in {8, 16, 32}. [...] At epoch 100, the learning rate is reduced...