QSGD: Communication-Efficient SGD via Gradient Quantization and Encoding

Authors: Dan Alistarh, Demjan Grubic, Jerry Li, Ryota Tomioka, Milan Vojnovic

NeurIPS 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental When applied to training deep neural networks for image classification and automated speech recognition, QSGD leads to significant reductions in end-to-end training time. For instance, on 16GPUs, we can train the Res Net-152 network to full accuracy on Image Net 1.8 faster than the full-precision variant. ... Experiments. The crucial question is whether, in practice, QSGD can reduce communication cost by enough to offset the overhead of any additional iterations to convergence. The answer is yes.
Researcher Affiliation Collaboration Dan Alistarh IST Austria & ETH Zurich dan.alistarh@ist.ac.at Demjan Grubic ETH Zurich & Google demjangrubic@gmail.com Jerry Z. Li MIT jerryzli@mit.edu Ryota Tomioka Microsoft Research ryoto@microsoft.com Milan Vojnovic London School of Economics M.Vojnovic@lse.ac.uk
Pseudocode Yes Algorithm 1: Parallel SGD Algorithm.
Open Source Code Yes Our code is released as open-source [31].
Open Datasets Yes We execute two types of tasks: image classification on ILSVRC 2015 (Image Net) [12], CIFAR-10 [25], and MNIST [27], and speech recognition on the CMU AN4 dataset [2].
Dataset Splits No The paper mentions using well-known datasets but does not explicitly provide details about the specific training, validation, or test splits (e.g., percentages, exact counts, or specific files) used for reproducibility, beyond implying standard benchmarks.
Hardware Specification Yes We performed experiments on Amazon EC2 p2.16xlarge instances, with 16 NVIDIA K80 GPUs.
Software Dependencies No The paper mentions implementing QSGD using the 'Microsoft Cognitive Toolkit (CNTK) [3]' but does not provide a specific version number for CNTK or any other software dependencies, which would be necessary for reproducibility.
Experiment Setup Yes We used standard sizes for the networks, with hyperparameters optimized for the 32bit precision variant. (Unless otherwise stated, we use the default networks and hyper-parameters optimized for full-precision CNTK 2.0.) We increased batch size when necessary to balance communication and computation for larger GPU counts, but never past the point where we lose accuracy.