On Distributed Adaptive Optimization with Gradient Compression

Authors: Xiaoyun Li, Belhal Karimi, Ping Li

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Numerical experiments are conducted to justify the theoretical findings, and demonstrate that the proposed method can achieve same test accuracy as the full-gradient AMSGrad with substantial communication savings. With its simplicity and efficiency, COMP-AMS can serve as a useful distributed training framework for adaptive gradient methods.
Researcher Affiliation Industry Cognitive Computing Lab Baidu Research 10900 NE 8th St. Bellevue, WA 98004, USA {xiaoyunli,belhalkarimi,liping11}@baidu.com
Pseudocode Yes Algorithm 1 AMSGRAD (Reddi et al., 2018), Algorithm 2 Distributed COMP-AMS with error feedback (EF)
Open Source Code No The paper states: 'Our method has been implemented in the PaddlePaddle platform (www.paddlepaddle.org.cn).' This indicates implementation on a public platform but does not explicitly state that the specific code developed for this paper is open-source or provide a link to it.
Open Datasets Yes The MNIST (Le Cun et al., 1998) contains 60000 training samples of 28x18 gray-scale hand-written digits from 10 classes, and 10000 test samples. The CIFAR-10 dataset (Krizhevsky & Hinton, 2009) consists of 50000 32x32 RGB natural images from 10 classes for training and 10000 images for testing, which is trained by Le Net-5 (Le Cun et al., 1998). The IMDB movie review (Maas et al., 2011) is a popular binary classification dataset for sentiment analysis.
Dataset Splits No The paper mentions sizes for training and test sets but does not explicitly provide details about a validation split (e.g., specific percentages or sample counts for validation data).
Hardware Specification Yes Our experiments are performed on a GPU cluster with NVIDIA Tesla V100 cards.
Software Dependencies No The paper mentions 'PaddlePaddle platform' but does not specify a version number for it or any other software dependencies.
Experiment Setup Yes For MNIST and CIFAR-10, the local batch size on each worker is set to be 32. For IMDB, the local batch size is 16. The hyper-parameters in COMP-AMS are set as default β1 = 0.9, β2 = 0.999 and ϵ = 10-8, which are also used for QAdam and 1Bit Adam. For 1Bit Adam, the epochs for warm-up training is set to be 1/20 of the total epochs. For all methods, we tune the initial learning rate over a fine grid (see Appendix A).