reproducibilityindex.ai

TernGrad: Ternary Gradients to Reduce Communication in Distributed Deep Learning

Authors: Wei Wen, Cong Xu, Feng Yan, Chunpeng Wu, Yandan Wang, Yiran Chen, Hai Li

NeurIPS 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments show that applying Tern Grad on Alex Net doesn’t incur any accuracy loss and can even improve accuracy. The accuracy loss of Goog Le Net induced by Tern Grad is less than 2% on average. Finally, a performance model is proposed to study the scalability of Tern Grad. Experiments show signiﬁcant speed gains for various deep neural networks.
Researcher Affiliation	Collaboration	1Duke University, 2Hewlett Packard Labs, 3University of Nevada Reno, 4University of Pittsburgh
Pseudocode	Yes	Algorithm 1 Tern Grad: distributed SGD training using ternary gradients.
Open Source Code	Yes	Our source code is available 1. 1https://github.com/wenwei202/terngrad
Open Datasets	Yes	We study the convergence of Tern Grad using Le Net on MNIST and a Conv Net [35] (named as Cifar Net) on CIFAR-10. ... We also evaluate Tern Grad by Alex Net and Goog Le Net trained on Image Net.
Dataset Splits	No	The paper mentions "Validation accuracy is evaluated using only the central crops of images." but does not provide specific training/validation/test split percentages or sample counts.
Hardware Specification	Yes	Training throughput on GPU cluster with Ethernet and PCI switch Alex Net FP32 Alex Net Tern Grad Goog Le Net FP32 Goog Le Net Tern Grad Vgg Net-A FP32 Vgg Net-A Tern Grad 1 2 4 8 16 32 64 128 256 512 Training throughput on GPU cluster with Infini Band and NVLink Alex Net FP32 Alex Net Tern Grad Goog Le Net FP32 Goog Le Net Tern Grad Vgg Net-A FP32 Vgg Net-A Tern Grad 1 2 4 8 16 32 64 128 256 512 Figure 5: Training throughput on two different GPUs clusters: (a) 128-node GPU cluster with 1Gbps Ethernet, each node has 4 NVIDIA GTX 1080 GPUs and one PCI switch; (b) 128-node GPU cluster with 100 Gbps Inﬁni Band network connections, each node has 4 NVIDIA Tesla P100 GPUs connected via NVLink.
Software Dependencies	No	The experiments are performed by Tensor Flow[2]. The paper mentions TensorFlow but does not provide a specific version number or any other software dependencies with version numbers.
Experiment Setup	Yes	For fair comparison, in each pair of comparative experiments using either ﬂoating or ternary gradients, all the other training hyper-parameters are the same unless differences are explicitly pointed out. In experiments, when SGD with momentum is adopted, momentum value of 0.9 is used. When polynomial decay is applied to decay the learning rate (LR), the power of 0.5 is used to decay LR from the base LR to zero. ... (1) decreasing dropout ratio to keep more neurons; (2) using smaller weight decay; and (3) disabling ternarizing in the last classiﬁcation layer. ... Speciﬁcally, we used a CNN [35] trained on CIFAR-10 by momentum SGD with staircase learning rate and obtained the optimal c = 2.5.