Group-based Interleaved Pipeline Parallelism for Large-scale DNN Training
Authors: PengCheng Yang, Xiaoming Zhang, Wenpeng Zhang, Ming Yang, Hong Wei
ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The experiments, which train large BERT language models, show that compared to Pipe Dream-2BW, WPipe achieves 1.4 acceleration and reduces the memory footprint by 36%, without nearly sacrificing any final model accuracy. |
| Researcher Affiliation | Industry | Pengcheng Yang, Xiaoming Zhang, Wenpeng Zhang, Ming Yang, Hong Wei Ant Group, China yangpc615@gmail.com, xiaominglan.zhang@antgroup.com zhangwenpeng0@gmail.com, vincent.ym@antgroup.com weihong9646@hotmail.com |
| Pseudocode | No | The paper does not contain any pseudocode or clearly labeled algorithm blocks. |
| Open Source Code | No | The paper mentions using PyTorch, transformers, and apex, which are open-source, but it does not provide a link to its own implementation code for WPipe. |
| Open Datasets | Yes | We finetuned BERTBASE (Devlin et al., 2018) and BERTLARGE (Devlin et al., 2018) for WPipe, Pipe Dream-2BW, and data parallelism on the QQP and MNLI tasks (Wang et al., 2018). ... We finetuned, respectively, the Res Ne Xt50 (32x4d) (Xie et al., 2017) and Res Ne Xt101 (32x8d) (Xie et al., 2017) for WPipe, Pipe Dream-2BW, and data parallelism on the three datasets of CIFAR-10 (Krizhevsky et al., 2009), CIFAR-100 (Krizhevsky et al., 2009), and Oxford 102 Flowers (Nilsback & Zisserman, 2008). |
| Dataset Splits | No | The paper mentions using several standard datasets for training and evaluation, but it does not explicitly describe the training, validation, and test splits (e.g., percentages, sample counts, or references to predefined splits with specific citations of how they were applied in this work). |
| Hardware Specification | Yes | WPipe is implemented with Py Torch-1.4 (Edward Z. Yang, 2021) and executes on two environments, i.e., a single machine with eight 16-GB V100 GPUs (Env-1) and a private cluster with 8 8V100 GPUs (Env-2). ... there are 8 machines in our private cluster, and each machine has 8 GPUs with a memory size of 16G, Intel(R)Xeon(R) Platinum 8163 CPU, 512GB of RAM with a 25Gbps Ethernet interface, and 300GBps NVLink (nvl) |
| Software Dependencies | Yes | WPipe is implemented with Py Torch-1.4 (Edward Z. Yang, 2021)... We used respectively bert-base-uncase and bert-large-uncase pre-training weights from transformers-3.5.0 (Wolf et al., 2020). |
| Experiment Setup | Yes | We used Adam optimizer, a learning rate of 8 10 5(ν = 8 10 5) with 1000 steps warmup(ws = 1000) and a mini-batch size of 256(b = 256) for BERTBASE and the same optimizer, ν = 4 10 5 with ws = 2000 and b = 128 for BERTLARGE. ... We used the pre-training weights from the torchvision (Francisco Massa, 2021), SDG optimizer, ν = 1 10 2 with 0.05 warmup ratio and b = 256. |