Indirect Stochastic Gradient Quantization and Its Application in Distributed Deep Learning

Authors: Afshin Abdi, Faramarz Fekri3113-3120

AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we evaluate the properties of the developed ISGQ algorithms and their performance in distributed training. For the simulations, we consider MNIST database with fully-connected (784-1000-300-100-10) neural network (hereafter referred to as FC) and Lenet model (Le Cun et al. 1998), CIFAR-10 database using Cifar Net (Krizhevsky, Sutskever, and Hinton 2012), and Imagenet (Russakovsky et al. 2015) using Alex Net deep model (Krizhevsky, Sutskever, and Hinton 2012). The considered deep models, FC, Lenet, Cifar Net and Alex Net have approximately 1.16, 1.66, 1.07 and 62.4 million parameters, respectively. In all our experiments, we use SGD or Adam algorithm with initial learning rate 0.01, decay rate 0.98 per epoch and batch-sizes 256 or 128 per worker. To evaluate the reduction in the transmission bits as well as the performance loss of the trained model, we compared our proposed method against the baseline distributed training without any quantization (i.e., 32 bits used for the transmissions of values) and other direct quantization methods: 1-bit quantization of (Seide et al. 2014), Tern Grad (Wen et al. 2017) and QSGD (Alistarh et al. 2017). For implementation details and the distributed learning algorithm, please refer to the supplementary document.
Researcher Affiliation Academia Afshin Abdi, Faramarz Fekri School of Electrical and Computer Engineering Georgia Institute of Technology, Atlanta, GA {abdi, fekri}@gatech.edu
Pseudocode Yes Algorithm 1 Empirical MSE-ISGQ 1: Initialize α and β 2: for few iterations do 3: Fix α and solve (6) to update β. 4: Fix β and solve (6) to update α. 5: return Quantizers for X and Δ.
Open Source Code No For implementation details and the distributed learning algorithm, please refer to the supplementary document. This does not explicitly state code release for the methodology, nor provide a direct link.
Open Datasets Yes For the simulations, we consider MNIST database with fully-connected (784-1000-300-100-10) neural network (hereafter referred to as FC) and Lenet model (Le Cun et al. 1998), CIFAR-10 database using Cifar Net (Krizhevsky, Sutskever, and Hinton 2012), and Imagenet (Russakovsky et al. 2015) using Alex Net deep model (Krizhevsky, Sutskever, and Hinton 2012).
Dataset Splits No No specific dataset split information (exact percentages, sample counts, or detailed splitting methodology for training, validation, and test sets) was found. While it mentions 'batch-sizes 256 or 128 per worker' and 'test accuracy', the actual splits used are not detailed.
Hardware Specification Yes First, we compare the required total processing and quantization times... using Intel Core i7 CPU and Nvidia Titan Xp GPU.
Software Dependencies No No specific software dependencies with version numbers (e.g., 'Python 3.8, PyTorch 1.9, and CUDA 11.1') were explicitly stated in the paper.
Experiment Setup Yes In all our experiments, we use SGD or Adam algorithm with initial learning rate 0.01, decay rate 0.98 per epoch and batch-sizes 256 or 128 per worker.