TeraPipe: Token-Level Pipeline Parallelism for Training Large-Scale Language Models
Authors: Zhuohan Li, Siyuan Zhuang, Shiyuan Guo, Danyang Zhuo, Hao Zhang, Dawn Song, Ion Stoica
ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our evaluation shows that for the largest GPT-3 model with 175 billion parameters, Tera Pipe achieves a 5.0x speedup improvement over the state-of-the-art synchronous model-parallel training methods on an AWS cluster consisting of 48 p3.16xlarge instances. and We evaluate Tera Pipe following the setup in Brown et al. (2020). |
| Researcher Affiliation | Collaboration | Zhuohan Li 1 Siyuan Zhuang 1 Shiyuan Guo 1 Danyang Zhuo 2 Hao Zhang 1 Dawn Song 1 Ion Stoica 1 1UC Berkeley 2Duke University. |
| Pseudocode | Yes | Algorithm 1 Selecting optimal slicing scheme given tmax. |
| Open Source Code | Yes | The code for reproduction can be found at https: //github.com/zhuohan123/terapipe |
| Open Datasets | Yes | We evaluate Tera Pipe following the setup in Brown et al. (2020). Specifically, we test 3 settings in Brown et al. (2020): GPT3-1B, GPT3-13B, and GPT3-175B, which have 1 billion, 13 billion, and 175 billion parameters in total, respectively. |
| Dataset Splits | No | The paper mentions using GPT-3 models and an input sequence length of 2048 following Brown et al. (2020), but it does not specify details about train/validation/test dataset splits (e.g., percentages, sample counts, or explicit splitting methodology). |
| Hardware Specification | Yes | We evaluate the configurations on an AWS cluster with p3.16xlarge nodes (each with 8 NVIDIA V100 GPUs). |
| Software Dependencies | No | The paper does not explicitly provide version numbers for any ancillary software components like programming languages, frameworks (e.g., PyTorch, TensorFlow), or libraries. |
| Experiment Setup | Yes | Table 1. Model settings and parallel training setups used in the evaluation. N: Number of Transformer layers. H: Hidden state size. #Params: Number of total parameters. L: Input sequence length. #GPUs: Total number of GPUs. B: Batch size. #Data: Number of data parallel shards. #Pipe: Number of pipeline stages. #Op: Number of GPUs used for operational partitioning by each Transformer layer. and For all configurations, we set the input sequence length L = 2048 following Brown et al. (2020). and For each configuration, we select the maximal batch size that can fit the memory of the GPUs. |