reproducibilityindex.ai

BPipe: Memory-Balanced Pipeline Parallelism for Training Large Language Models

Authors: Taebum Kim, Hyoungjoo Kim, Gyeong-In Yu, Byung-Gon Chun

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our evaluation conducted on 48 A100 GPUs across six nodes interconnected with HDR Inﬁni Band shows that BPIPE accelerates the training of GPT-3 96B and GPT-3 134B models by 1.25x-2.17x compared to Megatron-LM, a stateof-the-art framework for training large language models. and 4. Evaluation To evaluate BPIPE, we ask the following questions. Does BPIPE facilitate faster training of large language models? ( 4.2) Does BPIPE ﬂatten the memory usage of each pipeline stage? ( 4.3) Does BPIPE efﬁciently evict and load activations without performance degradation? ( 4.4)
Researcher Affiliation	Collaboration	1Friendli AI Inc., Seoul, Korea 2Department of Computer Science and Engineering, Seoul National University, Seoul, Korea.
Pseudocode	Yes	Algorithm 1 Transfer Scheduling Algorithm
Open Source Code	No	We have implemented BPIPE on Megatron-LM v3 (Korthikanti et al., 2022). The paper does not provide an explicit statement or link for the open-source code of BPIPE itself.
Open Datasets	No	We evaluate GPT-3 (Brown et al., 2020) throughout the experiments, one of the most representative LLMs. We use three different model conﬁgurations, as shown in Table 1, in which the largest model has 134 billion parameters in total. Sequence length and vocabulary size are 2,048 and 51,200 for all models, respectively, and we use mixed precision training (Micikevicius et al., 2017). The paper does not provide concrete access information (link, DOI, specific citation for the dataset used for training/evaluation) for a publicly available dataset.
Dataset Splits	No	The paper does not provide specific details about training, validation, or test dataset splits.
Hardware Specification	Yes	Our evaluations are conducted on a cluster of six HPE Apollo 6500 Gen10 Plus nodes, each of which is equipped with 8 NVIDIA 80 Gi B A100 GPUs connected over NVLink and 4 Mellanox 200 Gbps HDR Inﬁni Band HCAs for communication.
Software Dependencies	Yes	We have implemented BPIPE on Megatron-LM v3 (Korthikanti et al., 2022). All experiments are executed on the NVIDIA Py Torch NGC 22.09 container.
Experiment Setup	Yes	Table 2. Training conﬁgurations of GPT-3 96B and GPT-3 134B models. tensor and pipeline represent the tensor and pipeline parallelism degrees, respectively. ... mb denotes the microbatch size, and each value corresponds to a different training conﬁguration. and We evaluate them with Megatron-LM for each recomputation scope in the order of none, attention, and layer, where the none scope does not recompute any activation, the attention scope recomputes only the self-attention of the Transformer layer, which is known as the selective recomputation (Korthikanti et al., 2022), and the layer scope recomputes the entire Transformer layer.