reproducibilityindex.ai

PipeTransformer: Automated Elastic Pipelining for Distributed Training of Large-scale Models

Authors: Chaoyang He, Shen Li, Mahdi Soltanolkotabi, Salman Avestimehr

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate Pipe Transformer using Vision Transformer (Vi T) on Image Net and BERT on SQu AD and GLUE datasets. Our results show that compared to the state-of-the-art baseline, Pipe Transformer attains up to 2.83fold speedup without losing accuracy.
Researcher Affiliation	Collaboration	1University of Southern California 2Facebook AI Research.
Pseudocode	Yes	Algorithm 1 presents the pseudo-code.
Open Source Code	Yes	Finally, we have modularized our training system with ﬂexible APIs and made the source code publicly available at https://Dist ML.ai.
Open Datasets	Yes	Experiments employ two representative Transformers in CV and NLP: Vision Transformer (Vi T) and BERT. Vi T was run on an image classiﬁcation task, initialized with pre-trained weights on Image Net21K and ﬁne-tuned on Image Net and CIFAR-100. BERT was run on two tasks, text classiﬁcation on the SST-2 dataset from the General Language Understanding Evaluation (GLUE) benchmark, and question answering on the SQu AD v1.1 Dataset (Stanford Question Answering) which is a collection of 100k crowdsourced question/answer pairs.
Dataset Splits	No	The paper mentions using standard academic datasets (ImageNet, CIFAR-100, SQuAD, GLUE) which typically have predefined validation splits, but it does not explicitly provide specific dataset split information (percentages, counts, or detailed methodology) for validation in the main text.
Hardware Specification	Yes	Experiments were conducted on 2 identical machines connected by Inﬁni Band CX353A (5GB/s), where each machine is equipped with 8 NVIDIA Quadro RTX 5000 (16GB GPU memory).
Software Dependencies	Yes	We used Py Torch Pipe as a building block... Hence, we used the developer version 1.8.0.dev20201219. The BERT model deﬁnition, conﬁguration, and related tokenizer are from Hugging Face 3.5.0.
Experiment Setup	Yes	Hyper-parameters. Experiments use Vi T-B/16 (12 transformer layers, 16 16 input patch size) for Image Net and CIFAR-100, BERT-large-uncased (24 layers) for SQu AD 1.1, and BERT-base-uncased (12 layers) for SST-2. With Pipe Transformer, Vi T and BERT training can set the per-pipeline batch size to around 400 and 64 respectively. Other hyperparameters (e.g., epoch, learning rate) for all experiments are presented in Appendix.