signSGD with Majority Vote is Communication Efficient and Fault Tolerant
Authors: Jeremy Bernstein, Jiawei Zhao, Kamyar Azizzadenesheli, Anima Anandkumar
ICLR 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Next, we embark on a large-scale empirical validation of our theory. We implement majority vote in the Pytorch deep learning framework, using CUDA kernels to bit pack sign tensors down to one bit. Our results provide experimental evidence for D1 D4. Comparing our framework to NCCL (the state of the art communications library), we were able to speed up Imagenet training by 25% when distributing over 7 to 15 AWS p3.2xlarge machines, albeit at a slight loss in generalisation. |
| Researcher Affiliation | Academia | 1Caltech, 2Nanjing University of Aeronautics and Astronautics, 3UC Irvine bernstein@caltech.edu, jiaweizhao@nuaa.edu.cn, kazizzad@uci.edu, anima@caltech.edu |
| Pseudocode | Yes | Algorithm 1 SIGNUM with majority vote, the proposed algorithm for distributed optimisation. Good default settings for the tested machine learning problems are η = 0.0001 and β = 0.9, though tuning is recommended. All operations on vectors are element-wise. Setting β = 0 yields SIGNSGD. |
| Open Source Code | No | The paper states that the distributed training system was built in Pytorch and used Gloo, but does not provide a link or explicit statement for its own implementation code being open-source. |
| Open Datasets | Yes | Benchmarking against the state of the art collective communications library (NCCL), our framework with the parameter server housed entirely on one machine led to a 25% reduction in time for training resnet50 on Imagenet when using 15 AWS p3.2xlarge machines. |
| Dataset Splits | No | The paper discusses training and testing on datasets like ImageNet, CIFAR-10, and Wiki Text-103 but does not explicitly provide details about training/validation/test splits such as specific percentages, sample counts, or references to predefined split files. |
| Hardware Specification | Yes | Comparing our framework to NCCL (the state of the art communications library), we were able to speed up Imagenet training by 25% when distributing over 7 to 15 AWS p3.2xlarge machines. These machines each contain one Nvidia Tesla V100 GPU, and AWS lists the connection speed between machines as up to 10 Gbps . |
| Software Dependencies | No | The paper mentions software like 'Pytorch deep learning framework (Paszke et al., 2017)' and 'Gloo (2018) communication library' and 'NCCL (2018) communication library'. However, it only provides references to papers about these software packages (with years), not specific version numbers of the software itself (e.g., PyTorch 1.x.x). |
| Experiment Setup | Yes | Good default settings for the tested machine learning problems are η = 0.0001 and β = 0.9, though tuning is recommended. [...] We train a resnet50 model and disitribute learning over 7 to 15 AWS p3.2xlarge machines. [...] resnet50 results use 7 p3.2xlarge machines for training Imagenet, each at batch size 128. alexnet uses 7 p3.2xlarge machines for Imagenet, each at batch size 64. QRNN uses 3 p3.16xlarge machines for training Wiki Text-103, each at batch size 240. |