A Multi-Level Framework for Accelerating Training Transformer Models

Authors: Longwei Zou, Han Zhang, Yangdong Deng

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments on transformer-based language models (e.g. Bert, GPT) as well as a vision model (e.g. Dei T) prove that the proposed framework reduces the computational cost by about 20% on training BERT/GPT-Base models and up to 51.6% on training the BERT-Large model while preserving the performance.
Researcher Affiliation Academia Longwei Zou, Han Zhang, Yangdong Deng Tsinghua University {zoulw22,han-zhan20}@mails.tsinghua.edu.cn dengyd@tsinghua.edu.cn
Pseudocode Yes Algorithm 1 expounds the V-cycle training process. ... Algorithm 2: Coalescing Operation ... Algorithm 3: De-Coalescing Operation ... Algorithm 4: Interpolation Operation.
Open Source Code Yes Code is available at https://github.com/Photooon/Multi-Level-Training-Framework
Open Datasets Yes We use English Wikipedia and Books Corpus (Zhu et al., 2015) as pre-training data for BERT and GPT, while Dei T is trained with Image Net (Deng et al., 2009). For evaluation, we test the pre-trained BERT on the GLUE benchmark (Wang et al., 2019).
Dataset Splits Yes We use English Wikipedia and Books Corpus (Zhu et al., 2015) as pre-training data for BERT and GPT, while Dei T is trained with Image Net (Deng et al., 2009). For evaluation, we test the pre-trained BERT on the GLUE benchmark (Wang et al., 2019). We evaluate the pre-trained GPT on LAMBADA (Paperno et al., 2016), PTB, Wiki Text-2 and Wiki Text103 under a zero-shot setting without fine-tuning on the training set. CIFAR10 (Krizhevsky et al., 2009), CIFAR100 (Krizhevsky et al., 2009), Flowers102 (Nilsback & Zisserman, 2008), and Stanford-Cars (Krause et al., 2013) are adopted to test the downstream performance of Dei T. ... We choose a batch size of 32, a learning rate from {5e-6, 1e-5, 2e-5, 3e-5, 5e-5}, and train the model with 5 epochs on all GLUE fine-tuning tasks. We run each training process for three times with random seeds for GLUE.
Hardware Specification Yes Our experiments are conducted on NVIDIA A100 GPUs.
Software Dependencies No As the mixed precision training (Micikevicius et al., 2018) and Deep Speed framework (Rajbhandari et al., 2020) are orthogonal to our method, we use both for the pre-training of BERT and GPT.
Experiment Setup Yes During pre-training, we train the BERT-Base with the following settings, 40 training epochs, 10K warm-up steps, a peak learning rate of 1e-4, and a batch size of 512. We remove the next sentence prediction task (Liu et al., 2019) and use a fixed sequence length of 128. We use the same settings for BERT-Large. In the case of GPT-Base, we use 20 training epochs, 10K warm-up steps, a peak learning rate of 1e-4, and a batch size of 256. We train the Dei T-B with a peak learning rate of 1e-3, 300 training epochs and a batch size of 1024. ... We use α = 0.25 for GPT and Dei T, α = 0.5 for BERT.