Moshpit SGD: Communication-Efficient Decentralized Training on Heterogeneous Unreliable Devices

Authors: Max Ryabinin, Eduard Gorbunov, Vsevolod Plokhotnyuk, Gennady Pekhimenko

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate the efficiency of our protocol for distributed optimization with strong theoretical guarantees. The experiments show 1.3x speedup for Res Net-50 training on Image Net compared to competitive gossip-based strategies and 1.5x speedup when training ALBERT-large on preemptible compute nodes.
Researcher Affiliation Collaboration Max Ryabinin Yandex, Russia HSE University, Russia Eduard Gorbunov MIPT, Russia HSE University, Russia Yandex, Russia Vsevolod Plokhotnyuk Yandex, Russia HSE University, Russia Gennady Pekhimenko University of Toronto, Canada Vector Institute, Canada
Pseudocode Yes Algorithm 1 Moshpit All-Reduce (for i-th peer) and Algorithm 2 Moshpit SGD
Open Source Code Yes 2Implementation and code of experiments are at github.com/yandex-research/moshpit-sgd.
Open Datasets Yes More specifically, we train Res Net-50 [65] on the ILSVRC [2] dataset... We opt for the ALBERT model [71] to make better use of communication-constrained devices. This model has fewer trainable parameters due to layer-wise weight sharing. Specifically, we train ALBERT-large (18M parameters) on the Book Corpus [72] dataset
Dataset Splits No The paper mentions using well-known datasets and following existing training protocols, which implies standard splits, but it does not explicitly state the train/validation/test splits with percentages or counts within the text of this paper.
Hardware Specification Yes Homogeneous: 16 servers with a single Tesla V100-PCIe GPU, 6 CPU cores, and 64GB RAM. Heterogeneous: a total of 81 GPUs (V100, 1080Ti, and P40) across 64 servers and workstations... Homogeneous: a single cloud instance with 8 Tesla V100-PCIe GPUs and 56 v CPUs; Heterogeneous: a total of 66 preemptible GPUs, 32 of which are cloud T4, and the remaining 34 are various devices rented on a public marketplace.
Software Dependencies No The paper mentions using SGD, LAMB optimizer, and Distributed Hash Tables but does not provide specific version numbers for software libraries or dependencies.
Experiment Setup Yes Trainers use SGD with Nesterov momentum with a batch size of 256 and 32-bit precision regardless of the GPU type... For Moshpit SGD, we use a two-dimensional grid with 4 and 8 groups for homogeneous and heterogeneous setups respectively... We minimize the masked language modeling loss (MLM) along with the sentence order prediction loss (SOP) using the LAMB optimizer [17] with a global batch size of 4096 and sequence length 512.