BPipe: Memory-Balanced Pipeline Parallelism for Training Large Language Models

Authors: Taebum Kim, Hyoungjoo Kim, Gyeong-In Yu, Byung-Gon Chun

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our evaluation conducted on 48 A100 GPUs across six nodes interconnected with HDR Infini Band shows that BPIPE accelerates the training of GPT-3 96B and GPT-3 134B models by 1.25x-2.17x compared to Megatron-LM, a stateof-the-art framework for training large language models. and 4. Evaluation To evaluate BPIPE, we ask the following questions. Does BPIPE facilitate faster training of large language models? ( 4.2) Does BPIPE flatten the memory usage of each pipeline stage? ( 4.3) Does BPIPE efficiently evict and load activations without performance degradation? ( 4.4)
Researcher Affiliation Collaboration 1Friendli AI Inc., Seoul, Korea 2Department of Computer Science and Engineering, Seoul National University, Seoul, Korea.
Pseudocode Yes Algorithm 1 Transfer Scheduling Algorithm
Open Source Code No We have implemented BPIPE on Megatron-LM v3 (Korthikanti et al., 2022). The paper does not provide an explicit statement or link for the open-source code of BPIPE itself.
Open Datasets No We evaluate GPT-3 (Brown et al., 2020) throughout the experiments, one of the most representative LLMs. We use three different model configurations, as shown in Table 1, in which the largest model has 134 billion parameters in total. Sequence length and vocabulary size are 2,048 and 51,200 for all models, respectively, and we use mixed precision training (Micikevicius et al., 2017). The paper does not provide concrete access information (link, DOI, specific citation for the dataset used for training/evaluation) for a publicly available dataset.
Dataset Splits No The paper does not provide specific details about training, validation, or test dataset splits.
Hardware Specification Yes Our evaluations are conducted on a cluster of six HPE Apollo 6500 Gen10 Plus nodes, each of which is equipped with 8 NVIDIA 80 Gi B A100 GPUs connected over NVLink and 4 Mellanox 200 Gbps HDR Infini Band HCAs for communication.
Software Dependencies Yes We have implemented BPIPE on Megatron-LM v3 (Korthikanti et al., 2022). All experiments are executed on the NVIDIA Py Torch NGC 22.09 container.
Experiment Setup Yes Table 2. Training configurations of GPT-3 96B and GPT-3 134B models. tensor and pipeline represent the tensor and pipeline parallelism degrees, respectively. ... mb denotes the microbatch size, and each value corresponds to a different training configuration. and We evaluate them with Megatron-LM for each recomputation scope in the order of none, attention, and layer, where the none scope does not recompute any activation, the attention scope recomputes only the self-attention of the Transformer layer, which is known as the selective recomputation (Korthikanti et al., 2022), and the layer scope recomputes the entire Transformer layer.