A Multi-Level Framework for Accelerating Training Transformer Models
Authors: Longwei Zou, Han Zhang, Yangdong Deng
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments on transformer-based language models (e.g. Bert, GPT) as well as a vision model (e.g. Dei T) prove that the proposed framework reduces the computational cost by about 20% on training BERT/GPT-Base models and up to 51.6% on training the BERT-Large model while preserving the performance. |
| Researcher Affiliation | Academia | Longwei Zou, Han Zhang, Yangdong Deng Tsinghua University {zoulw22,han-zhan20}@mails.tsinghua.edu.cn dengyd@tsinghua.edu.cn |
| Pseudocode | Yes | Algorithm 1 expounds the V-cycle training process. ... Algorithm 2: Coalescing Operation ... Algorithm 3: De-Coalescing Operation ... Algorithm 4: Interpolation Operation. |
| Open Source Code | Yes | Code is available at https://github.com/Photooon/Multi-Level-Training-Framework |
| Open Datasets | Yes | We use English Wikipedia and Books Corpus (Zhu et al., 2015) as pre-training data for BERT and GPT, while Dei T is trained with Image Net (Deng et al., 2009). For evaluation, we test the pre-trained BERT on the GLUE benchmark (Wang et al., 2019). |
| Dataset Splits | Yes | We use English Wikipedia and Books Corpus (Zhu et al., 2015) as pre-training data for BERT and GPT, while Dei T is trained with Image Net (Deng et al., 2009). For evaluation, we test the pre-trained BERT on the GLUE benchmark (Wang et al., 2019). We evaluate the pre-trained GPT on LAMBADA (Paperno et al., 2016), PTB, Wiki Text-2 and Wiki Text103 under a zero-shot setting without fine-tuning on the training set. CIFAR10 (Krizhevsky et al., 2009), CIFAR100 (Krizhevsky et al., 2009), Flowers102 (Nilsback & Zisserman, 2008), and Stanford-Cars (Krause et al., 2013) are adopted to test the downstream performance of Dei T. ... We choose a batch size of 32, a learning rate from {5e-6, 1e-5, 2e-5, 3e-5, 5e-5}, and train the model with 5 epochs on all GLUE fine-tuning tasks. We run each training process for three times with random seeds for GLUE. |
| Hardware Specification | Yes | Our experiments are conducted on NVIDIA A100 GPUs. |
| Software Dependencies | No | As the mixed precision training (Micikevicius et al., 2018) and Deep Speed framework (Rajbhandari et al., 2020) are orthogonal to our method, we use both for the pre-training of BERT and GPT. |
| Experiment Setup | Yes | During pre-training, we train the BERT-Base with the following settings, 40 training epochs, 10K warm-up steps, a peak learning rate of 1e-4, and a batch size of 512. We remove the next sentence prediction task (Liu et al., 2019) and use a fixed sequence length of 128. We use the same settings for BERT-Large. In the case of GPT-Base, we use 20 training epochs, 10K warm-up steps, a peak learning rate of 1e-4, and a batch size of 256. We train the Dei T-B with a peak learning rate of 1e-3, 300 training epochs and a batch size of 1024. ... We use α = 0.25 for GPT and Dei T, α = 0.5 for BERT. |