SWARM Parallelism: Training Large Models Can Be Surprisingly Communication-Efficient

Authors: Max Ryabinin, Tim Dettmers, Michael Diskin, Alexander Borzunov

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We empirically validate our findings and compare SWARM parallelism with existing large-scale training approaches. Finally, we combine our insights with compression strategies to train a large Transformer language model with 1B shared parameters ( 13B before sharing) on preemptible T4 GPUs with less than 200Mb/s network.
Researcher Affiliation Collaboration 1HSE University 2Yandex 3University of Washington.
Pseudocode Yes Algorithm 1 Pseudocode of stochastic wiring; Algorithm 2 Adaptive rebalancing for SWARM parallelism
Open Source Code Yes The code for our experiments can be found at github.com/yandex-research/swarm.
Open Datasets Yes We train a Transformer language model with the architecture similar to prior work (Brown et al., 2020; Wang & Komatsuzaki, 2021; Black et al., 2021) and 1.01 billion parameters in total. ... First, to verify that model parallelism with asynchronous updates does not have significant convergence issues, we train the model on the Pile (Gao et al., 2020) dataset with 400 preemptible T4 instances, each hosting one accelerator. ... we train a Transformer language model (Vaswani et al., 2017) on the Open Web Text corpus (Gokaslan & Cohen, 2019).
Dataset Splits No The paper mentions training on various datasets and evaluating performance but does not explicitly provide the specific percentages or counts for training, validation, and test splits within its text.
Hardware Specification Yes homogeneous V100 GPU nodes; Each worker uses a V100-PCIe GPU with 16 CPU threads (E5 v5-2660v4) and 128 GB RAM; preemptible T4 instances; 7 instances with 8 A100 GPU each
Software Dependencies Yes We use a popular Py Torch-based implementation of GPipe... We use hivemind==0.8.15 (Ryabinin & Gusev, 2020) with a single synchronous trainer based on the BERT training code from the Transformers library (Wolf et al., 2020). ... Each layer is a Transformer Encoder Layer from Py Torch 1.7.0 (Paszke et al., 2019) wrapped with activation checkpointing.
Experiment Setup Yes We use a batch size of 1 and sequences of 512 tokens; The microbatch size is 4 for xxlarge and 1 for GPT-3 and Ours , and the sequence length is 512. ... Our model consists of 3 stages, each containing a single Transformer decoder block with dmodel = 4096 and 16 layers per pipeline stage. ... SWARM nodes run rebalancing every T = 300 seconds, and trainers measure peer performance using a moving average with α = 0.1. ... We use the LAMB optimizer (You et al., 2020) with the batch size of 4096 and the sequence length of 512. On top of that, we set η = 10 3 and β2 = 0.95