Moshpit SGD: Communication-Efficient Decentralized Training on Heterogeneous Unreliable Devices
Authors: Max Ryabinin, Eduard Gorbunov, Vsevolod Plokhotnyuk, Gennady Pekhimenko
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate the efficiency of our protocol for distributed optimization with strong theoretical guarantees. The experiments show 1.3x speedup for Res Net-50 training on Image Net compared to competitive gossip-based strategies and 1.5x speedup when training ALBERT-large on preemptible compute nodes. |
| Researcher Affiliation | Collaboration | Max Ryabinin Yandex, Russia HSE University, Russia Eduard Gorbunov MIPT, Russia HSE University, Russia Yandex, Russia Vsevolod Plokhotnyuk Yandex, Russia HSE University, Russia Gennady Pekhimenko University of Toronto, Canada Vector Institute, Canada |
| Pseudocode | Yes | Algorithm 1 Moshpit All-Reduce (for i-th peer) and Algorithm 2 Moshpit SGD |
| Open Source Code | Yes | 2Implementation and code of experiments are at github.com/yandex-research/moshpit-sgd. |
| Open Datasets | Yes | More specifically, we train Res Net-50 [65] on the ILSVRC [2] dataset... We opt for the ALBERT model [71] to make better use of communication-constrained devices. This model has fewer trainable parameters due to layer-wise weight sharing. Specifically, we train ALBERT-large (18M parameters) on the Book Corpus [72] dataset |
| Dataset Splits | No | The paper mentions using well-known datasets and following existing training protocols, which implies standard splits, but it does not explicitly state the train/validation/test splits with percentages or counts within the text of this paper. |
| Hardware Specification | Yes | Homogeneous: 16 servers with a single Tesla V100-PCIe GPU, 6 CPU cores, and 64GB RAM. Heterogeneous: a total of 81 GPUs (V100, 1080Ti, and P40) across 64 servers and workstations... Homogeneous: a single cloud instance with 8 Tesla V100-PCIe GPUs and 56 v CPUs; Heterogeneous: a total of 66 preemptible GPUs, 32 of which are cloud T4, and the remaining 34 are various devices rented on a public marketplace. |
| Software Dependencies | No | The paper mentions using SGD, LAMB optimizer, and Distributed Hash Tables but does not provide specific version numbers for software libraries or dependencies. |
| Experiment Setup | Yes | Trainers use SGD with Nesterov momentum with a batch size of 256 and 32-bit precision regardless of the GPU type... For Moshpit SGD, we use a two-dimensional grid with 4 and 8 groups for homogeneous and heterogeneous setups respectively... We minimize the masked language modeling loss (MLM) along with the sentence order prediction loss (SOP) using the LAMB optimizer [17] with a global batch size of 4096 and sequence length 512. |