reproducibilityindex.ai

TeraPipe: Token-Level Pipeline Parallelism for Training Large-Scale Language Models

Authors: Zhuohan Li, Siyuan Zhuang, Shiyuan Guo, Danyang Zhuo, Hao Zhang, Dawn Song, Ion Stoica

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our evaluation shows that for the largest GPT-3 model with 175 billion parameters, Tera Pipe achieves a 5.0x speedup improvement over the state-of-the-art synchronous model-parallel training methods on an AWS cluster consisting of 48 p3.16xlarge instances. and We evaluate Tera Pipe following the setup in Brown et al. (2020).
Researcher Affiliation	Collaboration	Zhuohan Li 1 Siyuan Zhuang 1 Shiyuan Guo 1 Danyang Zhuo 2 Hao Zhang 1 Dawn Song 1 Ion Stoica 1 1UC Berkeley 2Duke University.
Pseudocode	Yes	Algorithm 1 Selecting optimal slicing scheme given tmax.
Open Source Code	Yes	The code for reproduction can be found at https: //github.com/zhuohan123/terapipe
Open Datasets	Yes	We evaluate Tera Pipe following the setup in Brown et al. (2020). Speciﬁcally, we test 3 settings in Brown et al. (2020): GPT3-1B, GPT3-13B, and GPT3-175B, which have 1 billion, 13 billion, and 175 billion parameters in total, respectively.
Dataset Splits	No	The paper mentions using GPT-3 models and an input sequence length of 2048 following Brown et al. (2020), but it does not specify details about train/validation/test dataset splits (e.g., percentages, sample counts, or explicit splitting methodology).
Hardware Specification	Yes	We evaluate the conﬁgurations on an AWS cluster with p3.16xlarge nodes (each with 8 NVIDIA V100 GPUs).
Software Dependencies	No	The paper does not explicitly provide version numbers for any ancillary software components like programming languages, frameworks (e.g., PyTorch, TensorFlow), or libraries.
Experiment Setup	Yes	Table 1. Model settings and parallel training setups used in the evaluation. N: Number of Transformer layers. H: Hidden state size. #Params: Number of total parameters. L: Input sequence length. #GPUs: Total number of GPUs. B: Batch size. #Data: Number of data parallel shards. #Pipe: Number of pipeline stages. #Op: Number of GPUs used for operational partitioning by each Transformer layer. and For all conﬁgurations, we set the input sequence length L = 2048 following Brown et al. (2020). and For each conﬁguration, we select the maximal batch size that can ﬁt the memory of the GPUs.