DropBP: Accelerating Fine-Tuning of Large Language Models by Dropping Backward Propagation
Authors: Sunghyeon Woo, Baeseong Park, Byeongwook Kim, Minjung Jo, Se Jung Kwon, Dongsuk Jeon, Dongsoo Lee
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Specifically, Drop BP can reduce training time by 44% with comparable accuracy to the baseline, accelerate convergence to the same perplexity by 1.5 , and enable training with a sequence length 6.2 larger on a single NVIDIA-A100 GPU. Furthermore, our Drop BP enabled a throughput increase of 79% on a NVIDIA A100 GPU and 117% on an Intel Gaudi2 HPU. |
| Researcher Affiliation | Collaboration | Sunghyeon Woo1 Baesung Park2 Byeongwook Kim2 Minjung Jo2 Se Jung Kwon2 Dongsuk Jeon1 Dongsoo Lee2 Seoul National University1 NAVER Cloud2 |
| Pseudocode | No | The paper includes code snippets in Figure 4, labeled 'Code implementation for integrating Drop BP,' which are actual Python code examples rather than pseudocode or formally labeled algorithm blocks. |
| Open Source Code | Yes | The code is available at https://github.com/WooSunghyeon/dropbp. |
| Open Datasets | Yes | We first fine-tuned LLa MA2-7B, 13B, and 70B [8] on Alpaca [11] and Dolly [13] datasets. ... We also fine-tunes LLa MA3-8B [9] on the Oasst1 dataset [40]... |
| Dataset Splits | Yes | Figure 5: Validation perplexity (PPL) for finetuning LLa MA2-70B through QLo RA (baseline) with Drop BP on the Alpaca dataset. |
| Hardware Specification | Yes | Specifically, Drop BP can reduce training time by 44% with comparable accuracy to the baseline, accelerate convergence to the same perplexity by 1.5 , and enable training with a sequence length 6.2 larger on a single NVIDIA-A100 GPU. Furthermore, our Drop BP enabled a throughput increase of 79% on a NVIDIA A100 GPU and 117% on an Intel Gaudi2 HPU. |
| Software Dependencies | No | The paper mentions developing a library in 'Py Torch [24]' and integrating code into 'Lit GPT [37]' and 'Hugging Face [38]', but does not specify version numbers for these software dependencies or any other libraries. |
| Experiment Setup | Yes | In our experimental setup, the Adam W [57] optimizer and a cosine annealing learning rate scheduler [58] were utilized as common settings. Lo RA [14] and QLo RA [18] were integrated to every linear layer of our model, with the Lo RA parameters r and α set to 8 and 16, respectively. We experimented with all the learning rates presented in Table 9 and reported the best accuracy achieved in Table 1-2. Table 9 provides 'Detailed Setup for Table 1-2. BS and MBS are denoted as the batch size and micro batch size, respectively. Mixed refers to mixed precision training [56] using BFloat16 (BF16) and 32-bit.' |