reproducibilityindex.ai

Moshpit SGD: Communication-Efficient Decentralized Training on Heterogeneous Unreliable Devices

Authors: Max Ryabinin, Eduard Gorbunov, Vsevolod Plokhotnyuk, Gennady Pekhimenko

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate the efficiency of our protocol for distributed optimization with strong theoretical guarantees. The experiments show 1.3x speedup for Res Net-50 training on Image Net compared to competitive gossip-based strategies and 1.5x speedup when training ALBERT-large on preemptible compute nodes.
Researcher Affiliation	Collaboration	Max Ryabinin Yandex, Russia HSE University, Russia Eduard Gorbunov MIPT, Russia HSE University, Russia Yandex, Russia Vsevolod Plokhotnyuk Yandex, Russia HSE University, Russia Gennady Pekhimenko University of Toronto, Canada Vector Institute, Canada
Pseudocode	Yes	Algorithm 1 Moshpit All-Reduce (for i-th peer) and Algorithm 2 Moshpit SGD
Open Source Code	Yes	2Implementation and code of experiments are at github.com/yandex-research/moshpit-sgd.
Open Datasets	Yes	More specifically, we train Res Net-50 [65] on the ILSVRC [2] dataset... We opt for the ALBERT model [71] to make better use of communication-constrained devices. This model has fewer trainable parameters due to layer-wise weight sharing. Specifically, we train ALBERT-large (18M parameters) on the Book Corpus [72] dataset
Dataset Splits	No	The paper mentions using well-known datasets and following existing training protocols, which implies standard splits, but it does not explicitly state the train/validation/test splits with percentages or counts within the text of this paper.
Hardware Specification	Yes	Homogeneous: 16 servers with a single Tesla V100-PCIe GPU, 6 CPU cores, and 64GB RAM. Heterogeneous: a total of 81 GPUs (V100, 1080Ti, and P40) across 64 servers and workstations... Homogeneous: a single cloud instance with 8 Tesla V100-PCIe GPUs and 56 v CPUs; Heterogeneous: a total of 66 preemptible GPUs, 32 of which are cloud T4, and the remaining 34 are various devices rented on a public marketplace.
Software Dependencies	No	The paper mentions using SGD, LAMB optimizer, and Distributed Hash Tables but does not provide specific version numbers for software libraries or dependencies.
Experiment Setup	Yes	Trainers use SGD with Nesterov momentum with a batch size of 256 and 32-bit precision regardless of the GPU type... For Moshpit SGD, we use a two-dimensional grid with 4 and 8 groups for homogeneous and heterogeneous setups respectively... We minimize the masked language modeling loss (MLM) along with the sentence order prediction loss (SOP) using the LAMB optimizer [17] with a global batch size of 4096 and sequence length 512.