reproducibilityindex.ai

Error Feedback Fixes SignSGD and other Gradient Compression Schemes

Authors: Sai Praneeth Karimireddy, Quentin Rebjock, Sebastian Stich, Martin Jaggi

ICML 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We show simple convex counterexamples where sign SGD does not converge to the optimum. Further, even when it does converge, sign SGD may generalize poorly when compared with SGD. These issues arise because of the biased nature of the sign compression operator. We then show that using error-feedback, i.e. incorporating the error made by the compression operator into the next step, overcomes these issues. We prove that our algorithm (EF-SGD) with arbitrary compression operator achieves the same rate of convergence as SGD without any additional assumptions. Thus EF-SGDachieves gradient compression for free. Our experiments thoroughly substantiate the theory.
Researcher Affiliation	Academia	Sai Praneeth Karimireddy 1 Quentin Rebjock 1 Sebastian U. Stich 1 Martin Jaggi 1 1EPFL, Switzerland. Correspondence to: Sai Praneeth Karimireddy <sai.karimireddy@epﬂ.ch>.
Pseudocode	Yes	Algorithm 1 EF-SIGNSGD (SIGNSGD with Error-Feedb.) ... Algorithm 2 EF-SGD (Compr. SGD with Error-Feedback)
Open Source Code	Yes	Our code is available at github.com/epfml/error-feedback-SGD.
Open Datasets	Yes	All our experiments used the Py Torch framework (Paszke et al., 2017) on the CIFAR-10/100 dataset (Krizhevsky & Hinton, 2009).
Dataset Splits	No	The paper states the data is 'randomly split into test and train' but does not explicitly mention a separate validation split or its proportions.
Hardware Specification	No	The paper mentions training on 'massive computational resources' and 'today’s data centers' but does not provide specific hardware details like GPU/CPU models or cloud instance types used for experiments.
Software Dependencies	No	The paper mentions using 'Py Torch framework (Paszke et al., 2017)' but does not provide a specific version number for PyTorch or other software dependencies.
Experiment Setup	Yes	All algorithms are run for 200 epochs. The learning-rate is decimated at 100 epochs and then again at 150 epochs. The initial learning rate is tuned manually (see Appendix A) for all algorithms using batch-size 128. For the smaller batch-sizes, the learning-rate is proportionally reduced as suggested in (Goyal et al., 2017). The momentum parameter β (where applicable) was ﬁxed to 0.9 and weight decay was left to the default value of 5 10 4.