PipeTransformer: Automated Elastic Pipelining for Distributed Training of Large-scale Models
Authors: Chaoyang He, Shen Li, Mahdi Soltanolkotabi, Salman Avestimehr
ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate Pipe Transformer using Vision Transformer (Vi T) on Image Net and BERT on SQu AD and GLUE datasets. Our results show that compared to the state-of-the-art baseline, Pipe Transformer attains up to 2.83fold speedup without losing accuracy. |
| Researcher Affiliation | Collaboration | 1University of Southern California 2Facebook AI Research. |
| Pseudocode | Yes | Algorithm 1 presents the pseudo-code. |
| Open Source Code | Yes | Finally, we have modularized our training system with flexible APIs and made the source code publicly available at https://Dist ML.ai. |
| Open Datasets | Yes | Experiments employ two representative Transformers in CV and NLP: Vision Transformer (Vi T) and BERT. Vi T was run on an image classification task, initialized with pre-trained weights on Image Net21K and fine-tuned on Image Net and CIFAR-100. BERT was run on two tasks, text classification on the SST-2 dataset from the General Language Understanding Evaluation (GLUE) benchmark, and question answering on the SQu AD v1.1 Dataset (Stanford Question Answering) which is a collection of 100k crowdsourced question/answer pairs. |
| Dataset Splits | No | The paper mentions using standard academic datasets (ImageNet, CIFAR-100, SQuAD, GLUE) which typically have predefined validation splits, but it does not explicitly provide specific dataset split information (percentages, counts, or detailed methodology) for validation in the main text. |
| Hardware Specification | Yes | Experiments were conducted on 2 identical machines connected by Infini Band CX353A (5GB/s), where each machine is equipped with 8 NVIDIA Quadro RTX 5000 (16GB GPU memory). |
| Software Dependencies | Yes | We used Py Torch Pipe as a building block... Hence, we used the developer version 1.8.0.dev20201219. The BERT model definition, configuration, and related tokenizer are from Hugging Face 3.5.0. |
| Experiment Setup | Yes | Hyper-parameters. Experiments use Vi T-B/16 (12 transformer layers, 16 16 input patch size) for Image Net and CIFAR-100, BERT-large-uncased (24 layers) for SQu AD 1.1, and BERT-base-uncased (12 layers) for SST-2. With Pipe Transformer, Vi T and BERT training can set the per-pipeline batch size to around 400 and 64 respectively. Other hyperparameters (e.g., epoch, learning rate) for all experiments are presented in Appendix. |