Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training

Authors: Yujun Lin, Song Han, Huizi Mao, Yu Wang, Bill Dally

ICLR 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We have applied Deep Gradient Compression to image classification, speech recognition, and language modeling with multiple datasets including Cifar10, Image Net, Penn Treebank, and Librispeech Corpus. On these scenarios, Deep Gradient Compression achieves a gradient compression ratio from 270 to 600 without losing accuracy, cutting the gradient size of Res Net-50 from 97MB to 0.35MB, and for Deep Speech from 488MB to 0.74MB.
Researcher Affiliation Collaboration Yujun Lin Tsinghua University Beijing National Research Center for Information Science and Technology linyy14@mails.tsinghua.edu.cn Song Han Stanford University Google Brain songhan@stanford.edu Huizi Mao Stanford University huizi@stanford.edu Yu Wang Tsinghua University Beijing National Research Center for Information Science and Technology yu-wang@mail.tsinghua.edu.cn William J. Dally Stanford University NVIDIA dally@stanford.edu
Pseudocode Yes Algorithm 1 Gradient Sparsification on node k; Algorithm 2 Distributed Synchronous SGD on node k; Algorithm 3 Deep Gradient Compression for vanilla momentum SGD on node k; Algorithm 4 Deep Gradient Compression for Nesterov momentum SGD on node k
Open Source Code No The paper does not contain any explicit statements or links indicating that the source code for the described methodology is publicly available.
Open Datasets Yes We have applied Deep Gradient Compression to image classification, speech recognition, and language modeling with multiple datasets including Cifar10, Image Net, Penn Treebank, and Librispeech Corpus.
Dataset Splits Yes Cifar10 consists of 50,000 training images and 10,000 validation images in 10 classes (Krizhevsky & Hinton, 2009), while Image Net contains over 1 million training images and 50,000 validation images in 1000 classes (Deng et al., 2009). The Penn Treebank corpus (PTB) dataset consists of 923,000 training, 73,000 validation and 82,000 test words (Marcus et al., 1993).
Hardware Specification Yes Each training node has 4 NVIDIA Titan XP GPUs and one PCI switch.
Software Dependencies No The paper mentions general software like MXNet but does not provide specific version numbers for any software dependencies, which are required for reproducibility.
Experiment Setup Yes In all experiments related to DGC, we rise the sparsity in the warm-up period as follows: 75%, 93.75%, 98.4375%, 99.6%, 99.9% (exponentially increase till 99.9%). The warm-up period for DGC is 4 epochs out of164 epochs for Cifar10 and 4 epochs out of 90 epochs for Image Net Dataset. The warm-up period is 1 epoch out of 40 epochs. We train the models with momentum SGD following the training schedule in Gross & Wilber (2016). We use Deep Speech architecture... with Nesterov momentum SGD and gradient clipping, while learning rate anneals every epoch.