Step-Ahead Error Feedback for Distributed Training with Compressed Gradient
Authors: An Xu, Zhouyuan Huo, Heng Huang10478-10486
AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Both our theoretical and empirical results show that our new methods can handle the gradient mismatch problem. The experimental results show that we can even train faster with common gradient compression schemes than both the full-precision training and local error feedback regarding the training epochs and without performance loss. |
| Researcher Affiliation | Collaboration | 1 Electrical and Computer Engineering Department, University of Pittsburgh, PA, USA 2 Google, Mountain View, CA, USA 3 JD Finance America Corporation, Mountain View, CA, USA {an.xu, heng.huang}@pitt.edu, zhouyuan.huo@gmail.com |
| Pseudocode | Yes | Algorithm 1 Distributed Momentum SGD with Double Way Compression. |
| Open Source Code | No | The paper does not provide concrete access to source code for the methodology described. |
| Open Datasets | Yes | We train the Res Net-56 (He et al. 2016) model with multiple workers (GPUs) on CIFAR-100 (Krizhevsky, Hinton et al. 2009) image classification task. |
| Dataset Splits | No | The paper does not provide specific dataset split information (exact percentages, sample counts, or detailed splitting methodology) for training, validation, and test sets. |
| Hardware Specification | No | The paper mentions "multiple workers (GPUs)" but does not provide specific hardware details like exact GPU/CPU models, processor types, or memory amounts used for running its experiments. |
| Software Dependencies | No | All experiments are implemented with Py Torch (Paszke et al. 2019). |
| Experiment Setup | Yes | The base learning rate is 0.1 and the total batch size is 128. The momentum constant is 0.9 and the weight decay is 5 10 4. For momentum SGD the model is trained for 200 epochs with a learning rate decay of 0.1 at epoch 100 and 150. |