AdaComp : Adaptive Residual Gradient Compression for Data-Parallel Distributed Training
Authors: Chia-Yu Chen, Jungwook Choi, Daniel Brand, Ankur Agrawal, Wei Zhang, Kailash Gopalakrishnan
AAAI 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We performed a suite of experiments using the Ada Comp algorithm. In this paper, we evaluate convergence and compression (items 3 and 4) but do not report the impact on runtime (items 1 and 2). Table 2: CNN, MLP, and LSTM results. Figure 2: Model convergence results for different networks, datasets and learner numbers. |
| Researcher Affiliation | Industry | Chia-Yu Chen, Jungwook Choi, Daniel Brand, Ankur Agrawal, Wei Zhang, Kailash Gopalakrishnan IBM Research AI 1101 Kitchawan Rd. Yorktown Heights, New York 10598 {cchen, choij, danbrand, ankuragr, weiz, kailash}@us.ibm.com |
| Pseudocode | Yes | The following pseudo code describes two algorithms. Algorithm 1 shows the gradient weight communication scheme we used to test Ada Comp, and algorithm 2 is the Ada Comp algorithm we propose. |
| Open Source Code | No | The paper does not contain any explicit statement about providing open-source code for their methodology or a link to a code repository. |
| Open Datasets | Yes | We show excellent results on a wide spectrum of state of the art Deep Learning models in multiple domains (vision, speech, language), datasets (MNIST, CIFAR10, Image Net, BN50, Shakespeare), optimizers (SGD with momentum, Adam) and network parameters (number of learners, minibatch-size etc.). Table 1 records the details of the datasets and neural network models we use in this paper. |
| Dataset Splits | No | The paper discusses training and testing, and reports test errors, but does not specify explicit percentages or sample counts for training, validation, and test splits. |
| Hardware Specification | Yes | Experiments were done using IBM Soft Layer cloud servers where each server node is equipped with two Intel Xeon E5-2690-V3 processors and two NVIDIA Tesla K80 cards. Each Xeon processor has 12 cores running at 2.66GHz and each Tesla K80 card contains two K40 GPUs each with 12GB of GDDR5 memory. |
| Software Dependencies | No | The paper mentions "The software platform is an in-house distributed deep learning framework ((Gupta, Zhang, and Wang 2016), (Nair and Gupta 2017)). The exchange of gradients is done in a peer-to-peer fashion using MPI." However, no specific version numbers for these or other software dependencies are provided, which is required for reproducibility. |
| Experiment Setup | Yes | In all these experiments we used the same hyper-parameters as the baseline (i.e., no compression). The selection of LT is empirical and is a balance between communication time and model accuracy; the same values are used across all models: LT is set to 50 for convolutional layers and to 500 for FC and LSTM layers. Table 2 provides "Mini-Batch size" and "Epochs" for various models. |