IntSGD: Adaptive Floatless Compression of Stochastic Gradients

Authors: Konstantin Mishchenko, Bokun Wang, Dmitry Kovalev, Peter Richtárik

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We empirically compare our Int SGD algorithm with several representative and strong baselines: SGD, Heuristic Int SGD (Sapio et al., 2021), SGD, Power SGD + Error-feedback (EF) (Vogels et al., 2019), Nat SGD (Horváth et al., 2019), and QSGD (Alistarh et al., 2017). The experiments are performed on 16 NVIDIA Tesla V100 GPUs located on 8 compute nodes of a cluster (2 GPUs per node) following the Power SGD paper. The compute nodes in the cluster utilize Infini Band HDR-100 Director Switch at 100Gbps speed for network connection. The cluster also supports the NVIDIA Collective Communications Library (NCCL). We consider two tasks: image classification by Res Net18 (He et al., 2016) on the CIFAR-10 dataset and language modeling by a 3-layer LSTM on the Wikitext-2 dataset.
Researcher Affiliation Academia Konstantin Mishchenko CNRS, École Normale Supérieure, Inria konsta.mish@gmail.com Bokun Wang KAUST bokunw.wang@gmail.com Dmitry Kovalev KAUST dakovalev1@gmail.com Peter Richtárik KAUST peter.richtarik@kaust.edu.sa
Pseudocode Yes Algorithm 1 Int SGD. Default setting for the tested problems: β = 0.9, ε = 10 8. Algorithm 2 Int SGD: adaptive block quantization. Algorithm 3 Int DIANA.
Open Source Code Yes Our code is built on the codebase of Power SGD4. We also borrow their all-reduce-based implementations of SGD and Power SGD. It is worth noting that QSGD and Nat SGD do not support all-reduce. Thus, we implement their collective communications by all-gather. The implementations for compression and decompression in QSGD and Nat SGD are from the authors of Nat SGD5. We attach our code in the supplementary material.
Open Datasets Yes We consider two tasks: image classification by Res Net18 (He et al., 2016) on the CIFAR-10 dataset and language modeling by a 3-layer LSTM on the Wikitext-2 dataset.
Dataset Splits Yes We tune the initial single-worker learning rate on the full-precision SGD and then apply it to Power SGD, QSGD, and Nat SGD. The initial learning rate is tuned in the range {0.05, 0.1, 0.2, 0.5} and we choose 0.1. For the logistic regression experiment, 'The whole dataset is split according to its original indices into n folds, and each fold is assigned to a local worker, i.e., the data are heterogeneous.'
Hardware Specification Yes The experiments are performed on 16 NVIDIA Tesla V100 GPUs located on 8 compute nodes of a cluster (2 GPUs per node) following the Power SGD paper.
Software Dependencies No While the paper mentions using 'Py Torch implementations' (Footnote 3), 'Power SGD codebase' (Footnote 4), and 'MPI4PY library' (citing Dalcín et al., 2005), it does not provide specific version numbers for any of these software dependencies.
Experiment Setup Yes For the task of training Res Net18 on the CIFAR-10 dataset, we utilize momentum β = 0.9 and weight decay with factor 10 4 (except the Batchnorm parameters) for all algorithms. All algorithms run for 300 epochs. The learning rate decays by 10 times at epoch 150 and 250. The initial learning rate is tuned in the range {0.05, 0.1, 0.2, 0.5} and we choose 0.1. For the task of training a 3-layer LSTM, all algorithms run for 90 epochs. We set the size of word embeddings to 650, the sequence length to 30, the number of hidden units per layer to 650, and the dropout rate to 0.4. Besides, we tie the word embedding and softmax weights. We tune the initial learning rate in the range of {0.6, 1.25, 2.5, 5} and we choose 1.25.