PowerSGD: Practical Low-Rank Gradient Compression for Distributed Optimization

Authors: Thijs Vogels, Sai Praneeth Karimireddy, Martin Jaggi

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We propose a new low-rank gradient compressor based on power iteration that can i) compress gradients rapidly, ii) efficiently aggregate the compressed gradients using all-reduce, and iii) achieve test performance on par with SGD. The proposed algorithm is the only method evaluated that achieves consistent wall-clock speedups when benchmarked against regular SGD using highly optimized off-the-shelf tools for distributed communication. We demonstrate reduced training times for convolutional networks as well as LSTMs on common datasets. This section demonstrates the practicality of POWERSGD for distributed optimization of deep neural networks. We show that the compression scheme of POWERSGD i) is fast and matches test performance of SGD, ii) scales well with increasing workers even with a sub-optimal communication backend, and iii) significantly reduces training time for larger models. Most of the analysis is performed on CIFAR10, in the setting described in the table on the right. We verify the generality of POWERSGD by an additional evaluation of an LSTM for language modeling on WIKITEXT-2.
Researcher Affiliation Academia Thijs Vogels EPFL Lausanne, Switzerland thijs.vogels@epfl.ch Sai Praneeth Karimireddy EPFL Lausanne, Switzerland sai.karimrieddy@epfl.ch Martin Jaggi EPFL Lausanne, Switzerland martin.jaggi@epfl.ch
Pseudocode Yes Algorithm 1 Rank-r POWERSGD compression and Algorithm 2 Distributed Error-feedback SGD with Momentum
Open Source Code Yes Our code is available at https://github.com/epfml/powersgd.
Open Datasets Yes We demonstrate reduced training times for convolutional networks as well as LSTMs on common datasets. Most of the analysis is performed on CIFAR10, in the setting described in the table on the right. We verify the generality of POWERSGD by an additional evaluation of an LSTM for language modeling on WIKITEXT-2.
Dataset Splits No The paper mentions using CIFAR10 and WIKITEXT-2 datasets for experiments but does not explicitly provide details about specific training/validation/test dataset splits (e.g., percentages, sample counts, or how validation sets were created if applicable from the training data).
Hardware Specification No The paper states 'We use 16 GPUs on 8 machines, connected through a fast (10Gbit/s) network.' but does not specify the exact GPU model (e.g., NVIDIA V100, RTX 3090) or CPU models used for the experiments.
Software Dependencies No The paper mentions 'NCCL (fastest in PYTORCH)' as the backend but does not specify version numbers for PyTorch, NCCL, or any other critical software dependencies.
Experiment Setup Yes Default experimental setting: Dataset CIFAR10 Architecture RESNET18 Number of workers 16 Backend NCCL (fastest in PYTORCH) Batch size 128 number of workers Momentum 0.9 Learning rate Tuned for 16 workers 0.1 16 for SGD. Scaled linearly by the number of workers LR decay /10 at epoch 150 and 250 LR warmup Linearly within 5 epochs, starting from the single-worker LR # Epochs 300 Weight decay 10 4, 0 for Batch Norm parameters Repetitions 3, with varying seeds Error bars min max