signSGD: Compressed Optimisation for Non-Convex Problems
Authors: Jeremy Bernstein, Yu-Xiang Wang, Kamyar Azizzadenesheli, Animashree Anandkumar
ICML 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | On the practical side we find that the momentum counterpart of SIGNSGD is able to match the accuracy and convergence speed of ADAM on deep Imagenet models. We extend our theory to the distributed setting, where the parameter server uses majority vote to aggregate gradient signs from each worker enabling 1-bit compression of worker-server communication in both directions. Using a theorem by Gauss (1823) we prove that majority vote can achieve the same reduction in variance as full precision distributed SGD. Thus, there is great promise for sign-based optimisation schemes to achieve fast communication and fast convergence. Code to reproduce experiments is to be found at https://github.com/jxbz/sign SGD. |
| Researcher Affiliation | Collaboration | 1Caltech 2Amazon AI 3UC Santa Barbara 4UC Irvine. Correspondence to: Jeremy Bernstein <bernstein@caltech.edu>, Yu Xiang Wang <yuxiangw@amazon.edu>. |
| Pseudocode | Yes | Algorithm 1 SIGNSGD; Algorithm 2 SIGNUM; Algorithm 3 Distributed training by majority vote |
| Open Source Code | Yes | Code to reproduce experiments is to be found at https://github.com/jxbz/sign SGD. |
| Open Datasets | Yes | throughout the paper we will make use of the CIFAR-10 (Krizhevsky, 2009) and Imagenet (Russakovsky et al., 2015) datasets. |
| Dataset Splits | Yes | Initial learning rate and weight decay were tuned on a separate validation set split off from the training set and all other hyperparameters were chosen to be those found favourable for SGD by the community. |
| Hardware Specification | No | The paper mentions "GPUs linked within a single machine" but does not specify any particular models or detailed hardware specifications used for the experiments. |
| Software Dependencies | No | The paper mentions open source implementations and discusses software like PyTorch and Caffe indirectly, but it does not specify any version numbers for key software components or libraries. |
| Experiment Setup | Yes | Top row: Resnet-20 architecture trained to epoch 50 on CIFAR-10 with a batch size of 128. Bottom row: Resnet-50 architecture trained to epoch 50 on Imagenet with a batch size of 256. Initial learning rate and weight decay were tuned on a separate validation set split off from the training set and all other hyperparameters were chosen to be those found favourable for SGD by the community. Note that for β = 0.9, we have C = 54 which is negligible. |