Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training
Authors: Yujun Lin, Song Han, Huizi Mao, Yu Wang, Bill Dally
ICLR 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We have applied Deep Gradient Compression to image classification, speech recognition, and language modeling with multiple datasets including Cifar10, Image Net, Penn Treebank, and Librispeech Corpus. On these scenarios, Deep Gradient Compression achieves a gradient compression ratio from 270 to 600 without losing accuracy, cutting the gradient size of Res Net-50 from 97MB to 0.35MB, and for Deep Speech from 488MB to 0.74MB. |
| Researcher Affiliation | Collaboration | Yujun Lin Tsinghua University Beijing National Research Center for Information Science and Technology linyy14@mails.tsinghua.edu.cn Song Han Stanford University Google Brain songhan@stanford.edu Huizi Mao Stanford University huizi@stanford.edu Yu Wang Tsinghua University Beijing National Research Center for Information Science and Technology yu-wang@mail.tsinghua.edu.cn William J. Dally Stanford University NVIDIA dally@stanford.edu |
| Pseudocode | Yes | Algorithm 1 Gradient Sparsification on node k; Algorithm 2 Distributed Synchronous SGD on node k; Algorithm 3 Deep Gradient Compression for vanilla momentum SGD on node k; Algorithm 4 Deep Gradient Compression for Nesterov momentum SGD on node k |
| Open Source Code | No | The paper does not contain any explicit statements or links indicating that the source code for the described methodology is publicly available. |
| Open Datasets | Yes | We have applied Deep Gradient Compression to image classification, speech recognition, and language modeling with multiple datasets including Cifar10, Image Net, Penn Treebank, and Librispeech Corpus. |
| Dataset Splits | Yes | Cifar10 consists of 50,000 training images and 10,000 validation images in 10 classes (Krizhevsky & Hinton, 2009), while Image Net contains over 1 million training images and 50,000 validation images in 1000 classes (Deng et al., 2009). The Penn Treebank corpus (PTB) dataset consists of 923,000 training, 73,000 validation and 82,000 test words (Marcus et al., 1993). |
| Hardware Specification | Yes | Each training node has 4 NVIDIA Titan XP GPUs and one PCI switch. |
| Software Dependencies | No | The paper mentions general software like MXNet but does not provide specific version numbers for any software dependencies, which are required for reproducibility. |
| Experiment Setup | Yes | In all experiments related to DGC, we rise the sparsity in the warm-up period as follows: 75%, 93.75%, 98.4375%, 99.6%, 99.9% (exponentially increase till 99.9%). The warm-up period for DGC is 4 epochs out of164 epochs for Cifar10 and 4 epochs out of 90 epochs for Image Net Dataset. The warm-up period is 1 epoch out of 40 epochs. We train the models with momentum SGD following the training schedule in Gross & Wilber (2016). We use Deep Speech architecture... with Nesterov momentum SGD and gradient clipping, while learning rate anneals every epoch. |