reproducibilityindex.ai

QSGD: Communication-Efficient SGD via Gradient Quantization and Encoding

Authors: Dan Alistarh, Demjan Grubic, Jerry Li, Ryota Tomioka, Milan Vojnovic

NeurIPS 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	When applied to training deep neural networks for image classiﬁcation and automated speech recognition, QSGD leads to signiﬁcant reductions in end-to-end training time. For instance, on 16GPUs, we can train the Res Net-152 network to full accuracy on Image Net 1.8 faster than the full-precision variant. ... Experiments. The crucial question is whether, in practice, QSGD can reduce communication cost by enough to offset the overhead of any additional iterations to convergence. The answer is yes.
Researcher Affiliation	Collaboration	Dan Alistarh IST Austria & ETH Zurich dan.alistarh@ist.ac.at Demjan Grubic ETH Zurich & Google demjangrubic@gmail.com Jerry Z. Li MIT jerryzli@mit.edu Ryota Tomioka Microsoft Research ryoto@microsoft.com Milan Vojnovic London School of Economics M.Vojnovic@lse.ac.uk
Pseudocode	Yes	Algorithm 1: Parallel SGD Algorithm.
Open Source Code	Yes	Our code is released as open-source [31].
Open Datasets	Yes	We execute two types of tasks: image classiﬁcation on ILSVRC 2015 (Image Net) [12], CIFAR-10 [25], and MNIST [27], and speech recognition on the CMU AN4 dataset [2].
Dataset Splits	No	The paper mentions using well-known datasets but does not explicitly provide details about the specific training, validation, or test splits (e.g., percentages, exact counts, or specific files) used for reproducibility, beyond implying standard benchmarks.
Hardware Specification	Yes	We performed experiments on Amazon EC2 p2.16xlarge instances, with 16 NVIDIA K80 GPUs.
Software Dependencies	No	The paper mentions implementing QSGD using the 'Microsoft Cognitive Toolkit (CNTK) [3]' but does not provide a specific version number for CNTK or any other software dependencies, which would be necessary for reproducibility.
Experiment Setup	Yes	We used standard sizes for the networks, with hyperparameters optimized for the 32bit precision variant. (Unless otherwise stated, we use the default networks and hyper-parameters optimized for full-precision CNTK 2.0.) We increased batch size when necessary to balance communication and computation for larger GPU counts, but never past the point where we lose accuracy.