signSGD with Majority Vote is Communication Efficient and Fault Tolerant

Authors: Jeremy Bernstein, Jiawei Zhao, Kamyar Azizzadenesheli, Anima Anandkumar

ICLR 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Next, we embark on a large-scale empirical validation of our theory. We implement majority vote in the Pytorch deep learning framework, using CUDA kernels to bit pack sign tensors down to one bit. Our results provide experimental evidence for D1 D4. Comparing our framework to NCCL (the state of the art communications library), we were able to speed up Imagenet training by 25% when distributing over 7 to 15 AWS p3.2xlarge machines, albeit at a slight loss in generalisation.
Researcher Affiliation Academia 1Caltech, 2Nanjing University of Aeronautics and Astronautics, 3UC Irvine bernstein@caltech.edu, jiaweizhao@nuaa.edu.cn, kazizzad@uci.edu, anima@caltech.edu
Pseudocode Yes Algorithm 1 SIGNUM with majority vote, the proposed algorithm for distributed optimisation. Good default settings for the tested machine learning problems are η = 0.0001 and β = 0.9, though tuning is recommended. All operations on vectors are element-wise. Setting β = 0 yields SIGNSGD.
Open Source Code No The paper states that the distributed training system was built in Pytorch and used Gloo, but does not provide a link or explicit statement for its own implementation code being open-source.
Open Datasets Yes Benchmarking against the state of the art collective communications library (NCCL), our framework with the parameter server housed entirely on one machine led to a 25% reduction in time for training resnet50 on Imagenet when using 15 AWS p3.2xlarge machines.
Dataset Splits No The paper discusses training and testing on datasets like ImageNet, CIFAR-10, and Wiki Text-103 but does not explicitly provide details about training/validation/test splits such as specific percentages, sample counts, or references to predefined split files.
Hardware Specification Yes Comparing our framework to NCCL (the state of the art communications library), we were able to speed up Imagenet training by 25% when distributing over 7 to 15 AWS p3.2xlarge machines. These machines each contain one Nvidia Tesla V100 GPU, and AWS lists the connection speed between machines as up to 10 Gbps .
Software Dependencies No The paper mentions software like 'Pytorch deep learning framework (Paszke et al., 2017)' and 'Gloo (2018) communication library' and 'NCCL (2018) communication library'. However, it only provides references to papers about these software packages (with years), not specific version numbers of the software itself (e.g., PyTorch 1.x.x).
Experiment Setup Yes Good default settings for the tested machine learning problems are η = 0.0001 and β = 0.9, though tuning is recommended. [...] We train a resnet50 model and disitribute learning over 7 to 15 AWS p3.2xlarge machines. [...] resnet50 results use 7 p3.2xlarge machines for training Imagenet, each at batch size 128. alexnet uses 7 p3.2xlarge machines for Imagenet, each at batch size 64. QRNN uses 3 p3.16xlarge machines for training Wiki Text-103, each at batch size 240.