Error Feedback Fixes SignSGD and other Gradient Compression Schemes
Authors: Sai Praneeth Karimireddy, Quentin Rebjock, Sebastian Stich, Martin Jaggi
ICML 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We show simple convex counterexamples where sign SGD does not converge to the optimum. Further, even when it does converge, sign SGD may generalize poorly when compared with SGD. These issues arise because of the biased nature of the sign compression operator. We then show that using error-feedback, i.e. incorporating the error made by the compression operator into the next step, overcomes these issues. We prove that our algorithm (EF-SGD) with arbitrary compression operator achieves the same rate of convergence as SGD without any additional assumptions. Thus EF-SGDachieves gradient compression for free. Our experiments thoroughly substantiate the theory. |
| Researcher Affiliation | Academia | Sai Praneeth Karimireddy 1 Quentin Rebjock 1 Sebastian U. Stich 1 Martin Jaggi 1 1EPFL, Switzerland. Correspondence to: Sai Praneeth Karimireddy <sai.karimireddy@epfl.ch>. |
| Pseudocode | Yes | Algorithm 1 EF-SIGNSGD (SIGNSGD with Error-Feedb.) ... Algorithm 2 EF-SGD (Compr. SGD with Error-Feedback) |
| Open Source Code | Yes | Our code is available at github.com/epfml/error-feedback-SGD. |
| Open Datasets | Yes | All our experiments used the Py Torch framework (Paszke et al., 2017) on the CIFAR-10/100 dataset (Krizhevsky & Hinton, 2009). |
| Dataset Splits | No | The paper states the data is 'randomly split into test and train' but does not explicitly mention a separate validation split or its proportions. |
| Hardware Specification | No | The paper mentions training on 'massive computational resources' and 'today’s data centers' but does not provide specific hardware details like GPU/CPU models or cloud instance types used for experiments. |
| Software Dependencies | No | The paper mentions using 'Py Torch framework (Paszke et al., 2017)' but does not provide a specific version number for PyTorch or other software dependencies. |
| Experiment Setup | Yes | All algorithms are run for 200 epochs. The learning-rate is decimated at 100 epochs and then again at 150 epochs. The initial learning rate is tuned manually (see Appendix A) for all algorithms using batch-size 128. For the smaller batch-sizes, the learning-rate is proportionally reduced as suggested in (Goyal et al., 2017). The momentum parameter β (where applicable) was fixed to 0.9 and weight decay was left to the default value of 5 10 4. |