Communication-efficient Distributed Learning for Large Batch Optimization

Authors: Rui Liu, Barzan Mozafari

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct extensive experiments to evaluate the effectiveness and efficiency of our method. Due to space limit, we focus on JOINTSPAR-LARS because optimizers with layerwise adaptive learning rates are more effective in the large batch setting (You et al., 2017; 2019). More experiments on JOINTSPAR and other SGD-based compression methods can be found in the appendix.
Researcher Affiliation Academia 1Computer Science and Engineering, University of Michigan, Ann Arbor. Correspondence to: Rui Liu <ruixliu@umich.edu>, Barzan Mozafari <mozafari@umich.edu>.
Pseudocode Yes Algorithm 1 Distribution update for pm t; Algorithm 2 Our distributed learning method JOINTSPAR (for each worker m)
Open Source Code No The paper does not contain any explicit statement or link indicating that the source code for the described methodology is open-source or publicly available.
Open Datasets Yes We use several benchmark datasets in our experiments: MNIST, Fashion-MNIST, SVHN, CIFAR10, CIFAR100 and Image Net.
Dataset Splits No The paper mentions using well-known datasets and training for a certain number of epochs, but it does not explicitly provide the specific training, validation, or test dataset splits (e.g., percentages, counts, or references to predefined splits).
Hardware Specification Yes All experiments are run on a computer cluster with up to 16 nodes. Each node has 20 physical CPU cores with clock speed up to 4 GHz, and 4 NVIDIA P100 GPUs. Nodes are connected via a 100Gb/s Infini Band fabric.
Software Dependencies No We use Py Torch (Paszke et al., 2019) to implement models and learning methods, and use mpi4py (Dalcin et al., 2011) as the communication framework in the distributed setting. The paper mentions software and cites them, but does not specify their version numbers.
Experiment Setup Yes We set the local batch size to 1024 for each machine, and uses the same tricks (i.e., linear scaling and warmup) as suggested in (Goyal et al., 2017). Other experimental settings are kept the same as in the previous subsection. We train the model on each dataset for 90 epochs with the first 5 epochs as the warmup stage as suggested in (Goyal et al., 2017). For the learning rate schedule, we set the initial learning rate as 0.1, and shrink the learning rate by 0.1 at epoch 30, 50, 70.