Acceleration of Large Transformer Model Training by Sensitivity-Based Layer Dropping

Authors: Yujie Zeng, Wenlong He, Ihor Vasyltsov, Jiali Pang, Lin Chen

AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our results show that SBLD solves the accuracy drop issue compared with prior layer dropping methods. Our SBLD method can decrease end-to-end training time by 19.67% during training of GPT-3 Medium model, the same time increasing the accuracy by 1.65% w.r.t. baseline. Furthermore, for Swin V2-L model the obtained Top-1 and Top5 accuracies are also higher vs. the baseline. Thus, the proposed method is efficient and practical to improve the large transformer model training. In this section, we demonstrate that proposed SBLD method outperforms existing PLD and no layer dropping methods as baseline (B/L) on various Transformer models. In order to evaluate the performance improvement of SBLD method, we firstly conduct a set of experiments on the pre-training of GPT-3 models. Next, we compared the performance of our pre-trained GPT-3 model on downstream tasks LAMBADA (Paperno et al. 2016). Finally, we apply SBLD method in pre-training and fine-tuning of Swin Transformer V2 Large model.
Researcher Affiliation Industry Yujie Zeng1, Wenlong He1, Ihor Vasyltsov2, Jiali Pang1, Lin Chen1 1Samsung R&D Institute China Xian 2 Samsung Advanced Institute of Technology {yujie.zeng, wenlong.he, ihor.vasiltsov, jiali.pang, lin81.chen}@samsung.com
Pseudocode Yes Algorithm 2 shows the whole workflow of SBLD. Algorithm 1: Update keep probability list
Open Source Code No The paper does not provide any concrete access to source code for the methodology described.
Open Datasets Yes We perform pre-training experiments with two GPT-3 models with different sizes: 1.7 billion parameters and 3.6 billion parameters on 32 and 64 NVIDIA A100-80GB GPUs with Book Corpus dataset (Zhu et al. 2019) for 10,000 steps. The downstream task LAMBADA (Paperno et al. 2016) is a benchmark dataset to evaluate the model by asking the model to predict the last word of sentences after reading a paragraph of context. We conduct experiments for pre-training on Image Net-21k (image size: 192x192) and fine-tuning on Image Net-1k (image size: 384x384) with Swin V2-L model on 128 and 64 GPUs respectively.
Dataset Splits No The paper mentions 'validation loss' and shows results for it, implying the use of a validation set, but it does not provide specific details on the dataset split (e.g., percentages or sample counts) used for training, validation, and testing.
Hardware Specification Yes All the experiments are performed on DIT GPU-based Supercomputer. Each server has 8 NVIDIA A100-80GB GPUs, 128 CPU cores (AMD EPYC 7543 32-Core Processor) and 1.0 TB of RAM.
Software Dependencies Yes The machines run Red Hat 8.4 operating system and the software environment includes Megatron v3.0, Python 3.8.7, CUDA 11.4, Py Torch-1.12.0 and NCCL 2.8.3.
Experiment Setup Yes We perform pre-training experiments with two GPT-3 models with different sizes: 1.7 billion parameters and 3.6 billion parameters on 32 and 64 NVIDIA A100-80GB GPUs with Book Corpus dataset (Zhu et al. 2019) for 10,000 steps. We train GPT-3 Medium model with PLD and SBLD method with different θ value (0.5/0.7/0.9) for the same iterations. Swin Transformer V2: Scaling Up Capacity and Resolution. We conduct experiments for pre-training on Image Net-21k (image size: 192x192) and fine-tuning on Image Net-1k (image size: 384x384) with Swin V2-L model on 128 and 64 GPUs respectively. SBLD method can reduce the fine-tune time of Swin V2-L model by 35% (3.86h with 20 epochs vs 5.97h with 30 epochs) with less Top-1 accuracy improvement (87.62% vs. 87.60%).