QSGD: Communication-Efficient SGD via Gradient Quantization and Encoding
Authors: Dan Alistarh, Demjan Grubic, Jerry Li, Ryota Tomioka, Milan Vojnovic
NeurIPS 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | When applied to training deep neural networks for image classification and automated speech recognition, QSGD leads to significant reductions in end-to-end training time. For instance, on 16GPUs, we can train the Res Net-152 network to full accuracy on Image Net 1.8 faster than the full-precision variant. ... Experiments. The crucial question is whether, in practice, QSGD can reduce communication cost by enough to offset the overhead of any additional iterations to convergence. The answer is yes. |
| Researcher Affiliation | Collaboration | Dan Alistarh IST Austria & ETH Zurich dan.alistarh@ist.ac.at Demjan Grubic ETH Zurich & Google demjangrubic@gmail.com Jerry Z. Li MIT jerryzli@mit.edu Ryota Tomioka Microsoft Research ryoto@microsoft.com Milan Vojnovic London School of Economics M.Vojnovic@lse.ac.uk |
| Pseudocode | Yes | Algorithm 1: Parallel SGD Algorithm. |
| Open Source Code | Yes | Our code is released as open-source [31]. |
| Open Datasets | Yes | We execute two types of tasks: image classification on ILSVRC 2015 (Image Net) [12], CIFAR-10 [25], and MNIST [27], and speech recognition on the CMU AN4 dataset [2]. |
| Dataset Splits | No | The paper mentions using well-known datasets but does not explicitly provide details about the specific training, validation, or test splits (e.g., percentages, exact counts, or specific files) used for reproducibility, beyond implying standard benchmarks. |
| Hardware Specification | Yes | We performed experiments on Amazon EC2 p2.16xlarge instances, with 16 NVIDIA K80 GPUs. |
| Software Dependencies | No | The paper mentions implementing QSGD using the 'Microsoft Cognitive Toolkit (CNTK) [3]' but does not provide a specific version number for CNTK or any other software dependencies, which would be necessary for reproducibility. |
| Experiment Setup | Yes | We used standard sizes for the networks, with hyperparameters optimized for the 32bit precision variant. (Unless otherwise stated, we use the default networks and hyper-parameters optimized for full-precision CNTK 2.0.) We increased batch size when necessary to balance communication and computation for larger GPU counts, but never past the point where we lose accuracy. |