Indirect Stochastic Gradient Quantization and Its Application in Distributed Deep Learning
Authors: Afshin Abdi, Faramarz Fekri3113-3120
AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section, we evaluate the properties of the developed ISGQ algorithms and their performance in distributed training. For the simulations, we consider MNIST database with fully-connected (784-1000-300-100-10) neural network (hereafter referred to as FC) and Lenet model (Le Cun et al. 1998), CIFAR-10 database using Cifar Net (Krizhevsky, Sutskever, and Hinton 2012), and Imagenet (Russakovsky et al. 2015) using Alex Net deep model (Krizhevsky, Sutskever, and Hinton 2012). The considered deep models, FC, Lenet, Cifar Net and Alex Net have approximately 1.16, 1.66, 1.07 and 62.4 million parameters, respectively. In all our experiments, we use SGD or Adam algorithm with initial learning rate 0.01, decay rate 0.98 per epoch and batch-sizes 256 or 128 per worker. To evaluate the reduction in the transmission bits as well as the performance loss of the trained model, we compared our proposed method against the baseline distributed training without any quantization (i.e., 32 bits used for the transmissions of values) and other direct quantization methods: 1-bit quantization of (Seide et al. 2014), Tern Grad (Wen et al. 2017) and QSGD (Alistarh et al. 2017). For implementation details and the distributed learning algorithm, please refer to the supplementary document. |
| Researcher Affiliation | Academia | Afshin Abdi, Faramarz Fekri School of Electrical and Computer Engineering Georgia Institute of Technology, Atlanta, GA {abdi, fekri}@gatech.edu |
| Pseudocode | Yes | Algorithm 1 Empirical MSE-ISGQ 1: Initialize α and β 2: for few iterations do 3: Fix α and solve (6) to update β. 4: Fix β and solve (6) to update α. 5: return Quantizers for X and Δ. |
| Open Source Code | No | For implementation details and the distributed learning algorithm, please refer to the supplementary document. This does not explicitly state code release for the methodology, nor provide a direct link. |
| Open Datasets | Yes | For the simulations, we consider MNIST database with fully-connected (784-1000-300-100-10) neural network (hereafter referred to as FC) and Lenet model (Le Cun et al. 1998), CIFAR-10 database using Cifar Net (Krizhevsky, Sutskever, and Hinton 2012), and Imagenet (Russakovsky et al. 2015) using Alex Net deep model (Krizhevsky, Sutskever, and Hinton 2012). |
| Dataset Splits | No | No specific dataset split information (exact percentages, sample counts, or detailed splitting methodology for training, validation, and test sets) was found. While it mentions 'batch-sizes 256 or 128 per worker' and 'test accuracy', the actual splits used are not detailed. |
| Hardware Specification | Yes | First, we compare the required total processing and quantization times... using Intel Core i7 CPU and Nvidia Titan Xp GPU. |
| Software Dependencies | No | No specific software dependencies with version numbers (e.g., 'Python 3.8, PyTorch 1.9, and CUDA 11.1') were explicitly stated in the paper. |
| Experiment Setup | Yes | In all our experiments, we use SGD or Adam algorithm with initial learning rate 0.01, decay rate 0.98 per epoch and batch-sizes 256 or 128 per worker. |