On Distributed Adaptive Optimization with Gradient Compression
Authors: Xiaoyun Li, Belhal Karimi, Ping Li
ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Numerical experiments are conducted to justify the theoretical findings, and demonstrate that the proposed method can achieve same test accuracy as the full-gradient AMSGrad with substantial communication savings. With its simplicity and efficiency, COMP-AMS can serve as a useful distributed training framework for adaptive gradient methods. |
| Researcher Affiliation | Industry | Cognitive Computing Lab Baidu Research 10900 NE 8th St. Bellevue, WA 98004, USA {xiaoyunli,belhalkarimi,liping11}@baidu.com |
| Pseudocode | Yes | Algorithm 1 AMSGRAD (Reddi et al., 2018), Algorithm 2 Distributed COMP-AMS with error feedback (EF) |
| Open Source Code | No | The paper states: 'Our method has been implemented in the PaddlePaddle platform (www.paddlepaddle.org.cn).' This indicates implementation on a public platform but does not explicitly state that the specific code developed for this paper is open-source or provide a link to it. |
| Open Datasets | Yes | The MNIST (Le Cun et al., 1998) contains 60000 training samples of 28x18 gray-scale hand-written digits from 10 classes, and 10000 test samples. The CIFAR-10 dataset (Krizhevsky & Hinton, 2009) consists of 50000 32x32 RGB natural images from 10 classes for training and 10000 images for testing, which is trained by Le Net-5 (Le Cun et al., 1998). The IMDB movie review (Maas et al., 2011) is a popular binary classification dataset for sentiment analysis. |
| Dataset Splits | No | The paper mentions sizes for training and test sets but does not explicitly provide details about a validation split (e.g., specific percentages or sample counts for validation data). |
| Hardware Specification | Yes | Our experiments are performed on a GPU cluster with NVIDIA Tesla V100 cards. |
| Software Dependencies | No | The paper mentions 'PaddlePaddle platform' but does not specify a version number for it or any other software dependencies. |
| Experiment Setup | Yes | For MNIST and CIFAR-10, the local batch size on each worker is set to be 32. For IMDB, the local batch size is 16. The hyper-parameters in COMP-AMS are set as default β1 = 0.9, β2 = 0.999 and ϵ = 10-8, which are also used for QAdam and 1Bit Adam. For 1Bit Adam, the epochs for warm-up training is set to be 1/20 of the total epochs. For all methods, we tune the initial learning rate over a fine grid (see Appendix A). |