A Convergence Analysis of Distributed SGD with Communication-Efficient Gradient Sparsification

Authors: Shaohuai Shi, Kaiyong Zhao, Qiang Wang, Zhenheng Tang, Xiaowen Chu

IJCAI 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Finally, we conduct extensive experiments on different machine learning models and data sets to verify the soundness of the assumptions and theoretical results, and discuss the impact of the compression ratio on the convergence performance. We conduct extensive experiments on representative deep learning models and data sets to verify the soundness of the assumptions and theoretical results.
Researcher Affiliation Academia Shaohuai Shi , Kaiyong Zhao , Qiang Wang , Zhenheng Tang and Xiaowen Chu Department of Computer Science, Hong Kong Baptist University {csshshi, kyzhao, qiangwang, zhtang, chxw}@comp.hkbu.edu.hk
Pseudocode Yes Algorithm 1 g Top-k S-SGD at worker p
Open Source Code No The paper does not provide any statement or link indicating that the source code for their methodology is publicly available.
Open Datasets Yes 1) Image classification: Two popular DNNs, VGG-16 [Simonyan and Zisserman, 2014] and Res Net-20 [He et al., 2016], are used for evaluation on the data set of Cifar-102 which consists of 50000 training images. 2) Language model: A 2-layer LSTM model (LSTM-PTB) with 1500 hidden units per layer is adopted for evaluation on the data set of PTB [Marcus et al., 1993], which contains 923000 training words. 3) Speech recognition: A 5-layer LSTM model (LSTM-AN4) with 800 hidden units per layer is used for evaluation on AN4 [Acero, 1990], which contains 948 training utterances. Footnote 2: https://www.cs.toronto.edu/~kriz/cifar.html
Dataset Splits No The paper mentions '50000 training images' for Cifar-10 and mini-batch sizes but does not specify how the datasets are split into training, validation, and test sets with percentages or counts for each.
Hardware Specification No The paper mentions 'GPU clusters' but does not provide specific details on the hardware, such as GPU models, CPU types, or memory specifications.
Software Dependencies No The paper does not provide specific software dependencies or their version numbers (e.g., programming languages, libraries, frameworks).
Experiment Setup Yes In all training models, we exploit the warmup strategy in g Top-k S-SGD on the 4-worker distributed environment. The main hyper-parameters adopted in evaluation are shown in Table 1. Table 1: Hyper-parameters for different DNNs DNN B Initial α # of epochs VGG-16 512 0.1 140 Res Net-20 128 0.1 140 LSTM-PTB 400 30 40 LSTM-AN4 32 0.0002 80