DropBP: Accelerating Fine-Tuning of Large Language Models by Dropping Backward Propagation

Authors: Sunghyeon Woo, Baeseong Park, Byeongwook Kim, Minjung Jo, Se Jung Kwon, Dongsuk Jeon, Dongsoo Lee

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Specifically, Drop BP can reduce training time by 44% with comparable accuracy to the baseline, accelerate convergence to the same perplexity by 1.5 , and enable training with a sequence length 6.2 larger on a single NVIDIA-A100 GPU. Furthermore, our Drop BP enabled a throughput increase of 79% on a NVIDIA A100 GPU and 117% on an Intel Gaudi2 HPU.
Researcher Affiliation Collaboration Sunghyeon Woo1 Baesung Park2 Byeongwook Kim2 Minjung Jo2 Se Jung Kwon2 Dongsuk Jeon1 Dongsoo Lee2 Seoul National University1 NAVER Cloud2
Pseudocode No The paper includes code snippets in Figure 4, labeled 'Code implementation for integrating Drop BP,' which are actual Python code examples rather than pseudocode or formally labeled algorithm blocks.
Open Source Code Yes The code is available at https://github.com/WooSunghyeon/dropbp.
Open Datasets Yes We first fine-tuned LLa MA2-7B, 13B, and 70B [8] on Alpaca [11] and Dolly [13] datasets. ... We also fine-tunes LLa MA3-8B [9] on the Oasst1 dataset [40]...
Dataset Splits Yes Figure 5: Validation perplexity (PPL) for finetuning LLa MA2-70B through QLo RA (baseline) with Drop BP on the Alpaca dataset.
Hardware Specification Yes Specifically, Drop BP can reduce training time by 44% with comparable accuracy to the baseline, accelerate convergence to the same perplexity by 1.5 , and enable training with a sequence length 6.2 larger on a single NVIDIA-A100 GPU. Furthermore, our Drop BP enabled a throughput increase of 79% on a NVIDIA A100 GPU and 117% on an Intel Gaudi2 HPU.
Software Dependencies No The paper mentions developing a library in 'Py Torch [24]' and integrating code into 'Lit GPT [37]' and 'Hugging Face [38]', but does not specify version numbers for these software dependencies or any other libraries.
Experiment Setup Yes In our experimental setup, the Adam W [57] optimizer and a cosine annealing learning rate scheduler [58] were utilized as common settings. Lo RA [14] and QLo RA [18] were integrated to every linear layer of our model, with the Lo RA parameters r and α set to 8 and 16, respectively. We experimented with all the learning rates presented in Table 9 and reported the best accuracy achieved in Table 1-2. Table 9 provides 'Detailed Setup for Table 1-2. BS and MBS are denoted as the batch size and micro batch size, respectively. Mixed refers to mixed precision training [56] using BFloat16 (BF16) and 32-bit.'