reproducibilityindex.ai

signSGD: Compressed Optimisation for Non-Convex Problems

Authors: Jeremy Bernstein, Yu-Xiang Wang, Kamyar Azizzadenesheli, Animashree Anandkumar

ICML 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	On the practical side we find that the momentum counterpart of SIGNSGD is able to match the accuracy and convergence speed of ADAM on deep Imagenet models. We extend our theory to the distributed setting, where the parameter server uses majority vote to aggregate gradient signs from each worker enabling 1-bit compression of worker-server communication in both directions. Using a theorem by Gauss (1823) we prove that majority vote can achieve the same reduction in variance as full precision distributed SGD. Thus, there is great promise for sign-based optimisation schemes to achieve fast communication and fast convergence. Code to reproduce experiments is to be found at https://github.com/jxbz/sign SGD.
Researcher Affiliation	Collaboration	1Caltech 2Amazon AI 3UC Santa Barbara 4UC Irvine. Correspondence to: Jeremy Bernstein <bernstein@caltech.edu>, Yu Xiang Wang <yuxiangw@amazon.edu>.
Pseudocode	Yes	Algorithm 1 SIGNSGD; Algorithm 2 SIGNUM; Algorithm 3 Distributed training by majority vote
Open Source Code	Yes	Code to reproduce experiments is to be found at https://github.com/jxbz/sign SGD.
Open Datasets	Yes	throughout the paper we will make use of the CIFAR-10 (Krizhevsky, 2009) and Imagenet (Russakovsky et al., 2015) datasets.
Dataset Splits	Yes	Initial learning rate and weight decay were tuned on a separate validation set split off from the training set and all other hyperparameters were chosen to be those found favourable for SGD by the community.
Hardware Specification	No	The paper mentions "GPUs linked within a single machine" but does not specify any particular models or detailed hardware specifications used for the experiments.
Software Dependencies	No	The paper mentions open source implementations and discusses software like PyTorch and Caffe indirectly, but it does not specify any version numbers for key software components or libraries.
Experiment Setup	Yes	Top row: Resnet-20 architecture trained to epoch 50 on CIFAR-10 with a batch size of 128. Bottom row: Resnet-50 architecture trained to epoch 50 on Imagenet with a batch size of 256. Initial learning rate and weight decay were tuned on a separate validation set split off from the training set and all other hyperparameters were chosen to be those found favourable for SGD by the community. Note that for β = 0.9, we have C = 54 which is negligible.