TeraPipe: Token-Level Pipeline Parallelism for Training Large-Scale Language Models

Authors: Zhuohan Li, Siyuan Zhuang, Shiyuan Guo, Danyang Zhuo, Hao Zhang, Dawn Song, Ion Stoica

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our evaluation shows that for the largest GPT-3 model with 175 billion parameters, Tera Pipe achieves a 5.0x speedup improvement over the state-of-the-art synchronous model-parallel training methods on an AWS cluster consisting of 48 p3.16xlarge instances. and We evaluate Tera Pipe following the setup in Brown et al. (2020).
Researcher Affiliation Collaboration Zhuohan Li 1 Siyuan Zhuang 1 Shiyuan Guo 1 Danyang Zhuo 2 Hao Zhang 1 Dawn Song 1 Ion Stoica 1 1UC Berkeley 2Duke University.
Pseudocode Yes Algorithm 1 Selecting optimal slicing scheme given tmax.
Open Source Code Yes The code for reproduction can be found at https: //github.com/zhuohan123/terapipe
Open Datasets Yes We evaluate Tera Pipe following the setup in Brown et al. (2020). Specifically, we test 3 settings in Brown et al. (2020): GPT3-1B, GPT3-13B, and GPT3-175B, which have 1 billion, 13 billion, and 175 billion parameters in total, respectively.
Dataset Splits No The paper mentions using GPT-3 models and an input sequence length of 2048 following Brown et al. (2020), but it does not specify details about train/validation/test dataset splits (e.g., percentages, sample counts, or explicit splitting methodology).
Hardware Specification Yes We evaluate the configurations on an AWS cluster with p3.16xlarge nodes (each with 8 NVIDIA V100 GPUs).
Software Dependencies No The paper does not explicitly provide version numbers for any ancillary software components like programming languages, frameworks (e.g., PyTorch, TensorFlow), or libraries.
Experiment Setup Yes Table 1. Model settings and parallel training setups used in the evaluation. N: Number of Transformer layers. H: Hidden state size. #Params: Number of total parameters. L: Input sequence length. #GPUs: Total number of GPUs. B: Batch size. #Data: Number of data parallel shards. #Pipe: Number of pipeline stages. #Op: Number of GPUs used for operational partitioning by each Transformer layer. and For all configurations, we set the input sequence length L = 2048 following Brown et al. (2020). and For each configuration, we select the maximal batch size that can fit the memory of the GPUs.