Acceleration of Large Transformer Model Training by Sensitivity-Based Layer Dropping
Authors: Yujie Zeng, Wenlong He, Ihor Vasyltsov, Jiali Pang, Lin Chen
AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our results show that SBLD solves the accuracy drop issue compared with prior layer dropping methods. Our SBLD method can decrease end-to-end training time by 19.67% during training of GPT-3 Medium model, the same time increasing the accuracy by 1.65% w.r.t. baseline. Furthermore, for Swin V2-L model the obtained Top-1 and Top5 accuracies are also higher vs. the baseline. Thus, the proposed method is efficient and practical to improve the large transformer model training. In this section, we demonstrate that proposed SBLD method outperforms existing PLD and no layer dropping methods as baseline (B/L) on various Transformer models. In order to evaluate the performance improvement of SBLD method, we firstly conduct a set of experiments on the pre-training of GPT-3 models. Next, we compared the performance of our pre-trained GPT-3 model on downstream tasks LAMBADA (Paperno et al. 2016). Finally, we apply SBLD method in pre-training and fine-tuning of Swin Transformer V2 Large model. |
| Researcher Affiliation | Industry | Yujie Zeng1, Wenlong He1, Ihor Vasyltsov2, Jiali Pang1, Lin Chen1 1Samsung R&D Institute China Xian 2 Samsung Advanced Institute of Technology {yujie.zeng, wenlong.he, ihor.vasiltsov, jiali.pang, lin81.chen}@samsung.com |
| Pseudocode | Yes | Algorithm 2 shows the whole workflow of SBLD. Algorithm 1: Update keep probability list |
| Open Source Code | No | The paper does not provide any concrete access to source code for the methodology described. |
| Open Datasets | Yes | We perform pre-training experiments with two GPT-3 models with different sizes: 1.7 billion parameters and 3.6 billion parameters on 32 and 64 NVIDIA A100-80GB GPUs with Book Corpus dataset (Zhu et al. 2019) for 10,000 steps. The downstream task LAMBADA (Paperno et al. 2016) is a benchmark dataset to evaluate the model by asking the model to predict the last word of sentences after reading a paragraph of context. We conduct experiments for pre-training on Image Net-21k (image size: 192x192) and fine-tuning on Image Net-1k (image size: 384x384) with Swin V2-L model on 128 and 64 GPUs respectively. |
| Dataset Splits | No | The paper mentions 'validation loss' and shows results for it, implying the use of a validation set, but it does not provide specific details on the dataset split (e.g., percentages or sample counts) used for training, validation, and testing. |
| Hardware Specification | Yes | All the experiments are performed on DIT GPU-based Supercomputer. Each server has 8 NVIDIA A100-80GB GPUs, 128 CPU cores (AMD EPYC 7543 32-Core Processor) and 1.0 TB of RAM. |
| Software Dependencies | Yes | The machines run Red Hat 8.4 operating system and the software environment includes Megatron v3.0, Python 3.8.7, CUDA 11.4, Py Torch-1.12.0 and NCCL 2.8.3. |
| Experiment Setup | Yes | We perform pre-training experiments with two GPT-3 models with different sizes: 1.7 billion parameters and 3.6 billion parameters on 32 and 64 NVIDIA A100-80GB GPUs with Book Corpus dataset (Zhu et al. 2019) for 10,000 steps. We train GPT-3 Medium model with PLD and SBLD method with different θ value (0.5/0.7/0.9) for the same iterations. Swin Transformer V2: Scaling Up Capacity and Resolution. We conduct experiments for pre-training on Image Net-21k (image size: 192x192) and fine-tuning on Image Net-1k (image size: 384x384) with Swin V2-L model on 128 and 64 GPUs respectively. SBLD method can reduce the fine-tune time of Swin V2-L model by 35% (3.86h with 20 epochs vs 5.97h with 30 epochs) with less Top-1 accuracy improvement (87.62% vs. 87.60%). |