PipeTransformer: Automated Elastic Pipelining for Distributed Training of Large-scale Models

Authors: Chaoyang He, Shen Li, Mahdi Soltanolkotabi, Salman Avestimehr

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate Pipe Transformer using Vision Transformer (Vi T) on Image Net and BERT on SQu AD and GLUE datasets. Our results show that compared to the state-of-the-art baseline, Pipe Transformer attains up to 2.83fold speedup without losing accuracy.
Researcher Affiliation Collaboration 1University of Southern California 2Facebook AI Research.
Pseudocode Yes Algorithm 1 presents the pseudo-code.
Open Source Code Yes Finally, we have modularized our training system with flexible APIs and made the source code publicly available at https://Dist ML.ai.
Open Datasets Yes Experiments employ two representative Transformers in CV and NLP: Vision Transformer (Vi T) and BERT. Vi T was run on an image classification task, initialized with pre-trained weights on Image Net21K and fine-tuned on Image Net and CIFAR-100. BERT was run on two tasks, text classification on the SST-2 dataset from the General Language Understanding Evaluation (GLUE) benchmark, and question answering on the SQu AD v1.1 Dataset (Stanford Question Answering) which is a collection of 100k crowdsourced question/answer pairs.
Dataset Splits No The paper mentions using standard academic datasets (ImageNet, CIFAR-100, SQuAD, GLUE) which typically have predefined validation splits, but it does not explicitly provide specific dataset split information (percentages, counts, or detailed methodology) for validation in the main text.
Hardware Specification Yes Experiments were conducted on 2 identical machines connected by Infini Band CX353A (5GB/s), where each machine is equipped with 8 NVIDIA Quadro RTX 5000 (16GB GPU memory).
Software Dependencies Yes We used Py Torch Pipe as a building block... Hence, we used the developer version 1.8.0.dev20201219. The BERT model definition, configuration, and related tokenizer are from Hugging Face 3.5.0.
Experiment Setup Yes Hyper-parameters. Experiments use Vi T-B/16 (12 transformer layers, 16 16 input patch size) for Image Net and CIFAR-100, BERT-large-uncased (24 layers) for SQu AD 1.1, and BERT-base-uncased (12 layers) for SST-2. With Pipe Transformer, Vi T and BERT training can set the per-pipeline batch size to around 400 and 64 respectively. Other hyperparameters (e.g., epoch, learning rate) for all experiments are presented in Appendix.