reproducibilityindex.ai

Group-based Interleaved Pipeline Parallelism for Large-scale DNN Training

Authors: PengCheng Yang, Xiaoming Zhang, Wenpeng Zhang, Ming Yang, Hong Wei

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	The experiments, which train large BERT language models, show that compared to Pipe Dream-2BW, WPipe achieves 1.4 acceleration and reduces the memory footprint by 36%, without nearly sacriﬁcing any ﬁnal model accuracy.
Researcher Affiliation	Industry	Pengcheng Yang, Xiaoming Zhang, Wenpeng Zhang, Ming Yang, Hong Wei Ant Group, China yangpc615@gmail.com, xiaominglan.zhang@antgroup.com zhangwenpeng0@gmail.com, vincent.ym@antgroup.com weihong9646@hotmail.com
Pseudocode	No	The paper does not contain any pseudocode or clearly labeled algorithm blocks.
Open Source Code	No	The paper mentions using PyTorch, transformers, and apex, which are open-source, but it does not provide a link to its own implementation code for WPipe.
Open Datasets	Yes	We ﬁnetuned BERTBASE (Devlin et al., 2018) and BERTLARGE (Devlin et al., 2018) for WPipe, Pipe Dream-2BW, and data parallelism on the QQP and MNLI tasks (Wang et al., 2018). ... We ﬁnetuned, respectively, the Res Ne Xt50 (32x4d) (Xie et al., 2017) and Res Ne Xt101 (32x8d) (Xie et al., 2017) for WPipe, Pipe Dream-2BW, and data parallelism on the three datasets of CIFAR-10 (Krizhevsky et al., 2009), CIFAR-100 (Krizhevsky et al., 2009), and Oxford 102 Flowers (Nilsback & Zisserman, 2008).
Dataset Splits	No	The paper mentions using several standard datasets for training and evaluation, but it does not explicitly describe the training, validation, and test splits (e.g., percentages, sample counts, or references to predefined splits with specific citations of how they were applied in this work).
Hardware Specification	Yes	WPipe is implemented with Py Torch-1.4 (Edward Z. Yang, 2021) and executes on two environments, i.e., a single machine with eight 16-GB V100 GPUs (Env-1) and a private cluster with 8 8V100 GPUs (Env-2). ... there are 8 machines in our private cluster, and each machine has 8 GPUs with a memory size of 16G, Intel(R)Xeon(R) Platinum 8163 CPU, 512GB of RAM with a 25Gbps Ethernet interface, and 300GBps NVLink (nvl)
Software Dependencies	Yes	WPipe is implemented with Py Torch-1.4 (Edward Z. Yang, 2021)... We used respectively bert-base-uncase and bert-large-uncase pre-training weights from transformers-3.5.0 (Wolf et al., 2020).
Experiment Setup	Yes	We used Adam optimizer, a learning rate of 8 10 5(ν = 8 10 5) with 1000 steps warmup(ws = 1000) and a mini-batch size of 256(b = 256) for BERTBASE and the same optimizer, ν = 4 10 5 with ws = 2000 and b = 128 for BERTLARGE. ... We used the pre-training weights from the torchvision (Francisco Massa, 2021), SDG optimizer, ν = 1 10 2 with 0.05 warmup ratio and b = 256.