Communication-Efficient Distributed Blockwise Momentum SGD with Error-Feedback

Authors: Shuai Zheng, Ziyue Huang, James Kwok

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results show that the proposed method converges as fast as full-precision distributed momentum SGD and achieves the same testing accuracy. In particular, on distributed Res Net training with 7 workers on the Image Net, the proposed algorithm achieves the same testing accuracy as momentum SGD using full-precision gradients, but with 46% less wall clock time.
Researcher Affiliation Collaboration Shuai Zheng 1,2, Ziyue Huang1, James T. Kwok1 shzheng@amazon.com, {zhuangbq, jamesk}@cse.ust.hk 1Department of Computer Science and Engineering Hong Kong University of Science and Technology 2Amazon Web Services
Pseudocode Yes Algorithm 2 Distributed SGD with Error-Feedback (dist-EF-SGD); Algorithm 3 Distributed Blockwise SGD with Error-Feedback (dist-EF-block SGD); Algorithm 4 Distributed Blockwise Momentum SGD with Error-Feedback (dist-EF-block SGDM)
Open Source Code No The paper mentions using 'publicly available code3 in [4]' for comparisons, but does not provide its own source code for the methodology described.
Open Datasets Yes Experiment is performed on the CIFAR-100 dataset, with 50K training images and 10K test images. [...] In this section, we perform distributed optimization on Image Net [15] using a 50-layer Res Net.
Dataset Splits No The paper mentions '50K training images and 10K test images' for CIFAR-100 but does not specify a separate validation set split.
Hardware Specification Yes For faster experimentation, we use a single node with multiple GPUs (an AWS P3.16 instance with 8 Nvidia V100 GPUs, each GPU being a worker) instead of a distributed setting. [...] Each worker is an AWS P3.2 instance with 1 GPU, and the parameter server is housed in one node.
Software Dependencies No The paper mentions 'MXNet', 'Py Torch', and 'Gloo communication library' but does not specify version numbers for any of these software dependencies.
Experiment Setup Yes We vary the mini-batch size per worker in {8, 16, 32}. [...] At epoch 100, the learning rate is reduced...