BPipe: Memory-Balanced Pipeline Parallelism for Training Large Language Models
Authors: Taebum Kim, Hyoungjoo Kim, Gyeong-In Yu, Byung-Gon Chun
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our evaluation conducted on 48 A100 GPUs across six nodes interconnected with HDR Infini Band shows that BPIPE accelerates the training of GPT-3 96B and GPT-3 134B models by 1.25x-2.17x compared to Megatron-LM, a stateof-the-art framework for training large language models. and 4. Evaluation To evaluate BPIPE, we ask the following questions. Does BPIPE facilitate faster training of large language models? ( 4.2) Does BPIPE flatten the memory usage of each pipeline stage? ( 4.3) Does BPIPE efficiently evict and load activations without performance degradation? ( 4.4) |
| Researcher Affiliation | Collaboration | 1Friendli AI Inc., Seoul, Korea 2Department of Computer Science and Engineering, Seoul National University, Seoul, Korea. |
| Pseudocode | Yes | Algorithm 1 Transfer Scheduling Algorithm |
| Open Source Code | No | We have implemented BPIPE on Megatron-LM v3 (Korthikanti et al., 2022). The paper does not provide an explicit statement or link for the open-source code of BPIPE itself. |
| Open Datasets | No | We evaluate GPT-3 (Brown et al., 2020) throughout the experiments, one of the most representative LLMs. We use three different model configurations, as shown in Table 1, in which the largest model has 134 billion parameters in total. Sequence length and vocabulary size are 2,048 and 51,200 for all models, respectively, and we use mixed precision training (Micikevicius et al., 2017). The paper does not provide concrete access information (link, DOI, specific citation for the dataset used for training/evaluation) for a publicly available dataset. |
| Dataset Splits | No | The paper does not provide specific details about training, validation, or test dataset splits. |
| Hardware Specification | Yes | Our evaluations are conducted on a cluster of six HPE Apollo 6500 Gen10 Plus nodes, each of which is equipped with 8 NVIDIA 80 Gi B A100 GPUs connected over NVLink and 4 Mellanox 200 Gbps HDR Infini Band HCAs for communication. |
| Software Dependencies | Yes | We have implemented BPIPE on Megatron-LM v3 (Korthikanti et al., 2022). All experiments are executed on the NVIDIA Py Torch NGC 22.09 container. |
| Experiment Setup | Yes | Table 2. Training configurations of GPT-3 96B and GPT-3 134B models. tensor and pipeline represent the tensor and pipeline parallelism degrees, respectively. ... mb denotes the microbatch size, and each value corresponds to a different training configuration. and We evaluate them with Megatron-LM for each recomputation scope in the order of none, attention, and layer, where the none scope does not recompute any activation, the attention scope recomputes only the self-attention of the Transformer layer, which is known as the selective recomputation (Korthikanti et al., 2022), and the layer scope recomputes the entire Transformer layer. |