TernGrad: Ternary Gradients to Reduce Communication in Distributed Deep Learning

Authors: Wei Wen, Cong Xu, Feng Yan, Chunpeng Wu, Yandan Wang, Yiran Chen, Hai Li

NeurIPS 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments show that applying Tern Grad on Alex Net doesn’t incur any accuracy loss and can even improve accuracy. The accuracy loss of Goog Le Net induced by Tern Grad is less than 2% on average. Finally, a performance model is proposed to study the scalability of Tern Grad. Experiments show significant speed gains for various deep neural networks.
Researcher Affiliation Collaboration 1Duke University, 2Hewlett Packard Labs, 3University of Nevada Reno, 4University of Pittsburgh
Pseudocode Yes Algorithm 1 Tern Grad: distributed SGD training using ternary gradients.
Open Source Code Yes Our source code is available 1. 1https://github.com/wenwei202/terngrad
Open Datasets Yes We study the convergence of Tern Grad using Le Net on MNIST and a Conv Net [35] (named as Cifar Net) on CIFAR-10. ... We also evaluate Tern Grad by Alex Net and Goog Le Net trained on Image Net.
Dataset Splits No The paper mentions "Validation accuracy is evaluated using only the central crops of images." but does not provide specific training/validation/test split percentages or sample counts.
Hardware Specification Yes Training throughput on GPU cluster with Ethernet and PCI switch Alex Net FP32 Alex Net Tern Grad Goog Le Net FP32 Goog Le Net Tern Grad Vgg Net-A FP32 Vgg Net-A Tern Grad 1 2 4 8 16 32 64 128 256 512 Training throughput on GPU cluster with Infini Band and NVLink Alex Net FP32 Alex Net Tern Grad Goog Le Net FP32 Goog Le Net Tern Grad Vgg Net-A FP32 Vgg Net-A Tern Grad 1 2 4 8 16 32 64 128 256 512 Figure 5: Training throughput on two different GPUs clusters: (a) 128-node GPU cluster with 1Gbps Ethernet, each node has 4 NVIDIA GTX 1080 GPUs and one PCI switch; (b) 128-node GPU cluster with 100 Gbps Infini Band network connections, each node has 4 NVIDIA Tesla P100 GPUs connected via NVLink.
Software Dependencies No The experiments are performed by Tensor Flow[2]. The paper mentions TensorFlow but does not provide a specific version number or any other software dependencies with version numbers.
Experiment Setup Yes For fair comparison, in each pair of comparative experiments using either floating or ternary gradients, all the other training hyper-parameters are the same unless differences are explicitly pointed out. In experiments, when SGD with momentum is adopted, momentum value of 0.9 is used. When polynomial decay is applied to decay the learning rate (LR), the power of 0.5 is used to decay LR from the base LR to zero. ... (1) decreasing dropout ratio to keep more neurons; (2) using smaller weight decay; and (3) disabling ternarizing in the last classification layer. ... Specifically, we used a CNN [35] trained on CIFAR-10 by momentum SGD with staircase learning rate and obtained the optimal c = 2.5.