Tempo: Accelerating Transformer-Based Model Training through Memory Footprint Reduction

Authors: Muralidhar Andoorveedu, Zhanda Zhu, Bojian Zheng, Gennady Pekhimenko

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We implement Tempo and evaluate the throughput, memory usage, and accuracy/loss on the BERTLARGE pre-training task. We demonstrate that Tempo enables up to 2 higher batch sizes and 16% higher training throughput over the state-of-the-art baseline.
Researcher Affiliation Academia Muralidhar Andoorveedu1, Zhanda Zhu2,3, Bojian Zheng1,3, Gennady Pekhimenko1,3 1University of Toronto, Toronto, Canada 2Shanghai Jiao Tong University, Shanghai, China 3Vector Institute, Toronto, Canada {andoorve, zhanda, bojian, pekhimenko}@cs.toronto.edu
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks clearly labeled as such.
Open Source Code Yes We open-source Tempo for an immediate positive impact on both machine learning researchers and practitioners here: https://github.com/UofT-EcoSystem/Tempo.
Open Datasets Yes For pre-training, we employ the NVIDIA Deep Learning Examples library [41] with the English Wikipedia dataset [67].
Dataset Splits No The paper mentions pre-training and fine-tuning tasks but does not provide specific details on train/validation/test dataset splits, percentages, or sample counts for reproducibility.
Hardware Specification Yes Our main test setup consists of 4 NVIDIA RTX 2080 Ti GPUs [40], each with 11 GB of memory connected over PCIe v3 [47]. We also use an Amazon Web Services p3.8xlarge [3] instance consisting of 4 NVIDIA Tesla V100 GPUs [39] each with 16 GB of memory connected using NVLink [42]. For our ablation studies, we employ a system with an NVIDIA A100 GPU [43] with 40 GB of memory.
Software Dependencies No The paper mentions software like PyTorch, Huggingface library, and Fairseq library, but does not provide specific version numbers for these or other key software components.
Experiment Setup Yes We perform the training in two phases, the first (i.e., longer) phase at a sequence length of 128, and the second (i.e., shorter) phase at a sequence length of 512 [12, 41]. For throughput and memory experiments, we use the BERTLARGE configuration. ... profiling the Huggingface BERTBASE implementation [69] on the MRPC [13] fine-tuning task at a batch size of 32 and sequence length of 128... training for 10 epochs