Tempo: Accelerating Transformer-Based Model Training through Memory Footprint Reduction
Authors: Muralidhar Andoorveedu, Zhanda Zhu, Bojian Zheng, Gennady Pekhimenko
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We implement Tempo and evaluate the throughput, memory usage, and accuracy/loss on the BERTLARGE pre-training task. We demonstrate that Tempo enables up to 2 higher batch sizes and 16% higher training throughput over the state-of-the-art baseline. |
| Researcher Affiliation | Academia | Muralidhar Andoorveedu1, Zhanda Zhu2,3, Bojian Zheng1,3, Gennady Pekhimenko1,3 1University of Toronto, Toronto, Canada 2Shanghai Jiao Tong University, Shanghai, China 3Vector Institute, Toronto, Canada {andoorve, zhanda, bojian, pekhimenko}@cs.toronto.edu |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks clearly labeled as such. |
| Open Source Code | Yes | We open-source Tempo for an immediate positive impact on both machine learning researchers and practitioners here: https://github.com/UofT-EcoSystem/Tempo. |
| Open Datasets | Yes | For pre-training, we employ the NVIDIA Deep Learning Examples library [41] with the English Wikipedia dataset [67]. |
| Dataset Splits | No | The paper mentions pre-training and fine-tuning tasks but does not provide specific details on train/validation/test dataset splits, percentages, or sample counts for reproducibility. |
| Hardware Specification | Yes | Our main test setup consists of 4 NVIDIA RTX 2080 Ti GPUs [40], each with 11 GB of memory connected over PCIe v3 [47]. We also use an Amazon Web Services p3.8xlarge [3] instance consisting of 4 NVIDIA Tesla V100 GPUs [39] each with 16 GB of memory connected using NVLink [42]. For our ablation studies, we employ a system with an NVIDIA A100 GPU [43] with 40 GB of memory. |
| Software Dependencies | No | The paper mentions software like PyTorch, Huggingface library, and Fairseq library, but does not provide specific version numbers for these or other key software components. |
| Experiment Setup | Yes | We perform the training in two phases, the first (i.e., longer) phase at a sequence length of 128, and the second (i.e., shorter) phase at a sequence length of 512 [12, 41]. For throughput and memory experiments, we use the BERTLARGE configuration. ... profiling the Huggingface BERTBASE implementation [69] on the MRPC [13] fine-tuning task at a batch size of 32 and sequence length of 128... training for 10 epochs |