S$^{2}$FT: Efficient, Scalable and Generalizable LLM Fine-tuning by Structured Sparsity

Authors: Xinyu Yang, Jixuan Leng, Geyang Guo, Jiawei Zhao, Ryumei Nakada, Linjun Zhang, Huaxiu Yao, Beidi Chen

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through theoretical analyses and empirical results, our method prevents forgetting while simplifying optimization, delivers SOTA performance on both commonsense and arithmetic reasoning with 4.6% and 1.3% average improvements compared to Lo RA, and surpasses full FT by 11.5% when generalizing to various domains after instruction tuning. Using our partial back-propagation algorithm, S2FT saves training memory up to 3 and improves latency by 1.5-2.7 compared to full FT, while achieving an average 10% improvement over Lo RA on both metrics.
Researcher Affiliation Academia Xinyu Yang1, Jixuan Leng1, Geyang Guo2, Jiawei Zhao3, Ryumei Nakada4, Linjun Zhang4, Huaxiu Yao5, Beidi Chen1 1CMU, 2Georgia Tech, 3Caltech, 4Rutgers, 5UNC-Chapel Hill
Pseudocode Yes def setup_context(ctx, inputs, output): activation, weight, bias, start, end = inputs # only save partial input tensors for gradient calculation in forward ctx.save_for_backward(activation[:, start:end], weight, bias, start, end) def gradient_update(parameter, gradient, start, end): # only modify the assigned positions of weight matrices during optimization parameter[:, start:end].add_(gradient)
Open Source Code No We have the code required to reproduce our experimental results and are working towards making our code available in a public Git Hub repository.
Open Datasets Yes We fine-tune the LLa MA-3-8B model on the Math10K dataset [28]. The Math10K dataset combines training sets from GSM8K [14], MAWPS [32], and AQu A [37], augmented with chain-of-thought steps generated by language models. ... The commonsense reasoning dataset comprise eight subsets: Bool Q [12], PIQA [9], Social QA [58], Hella Swag [76], Wino Grande [57], ARC-challenge [13], ARC-easy [13], and Openbook QA [46].
Dataset Splits Yes For PEFT methods, we set three ratios of trainable parameters (p = 10%, 1%, 0.1%) and search for the optimal hyperparameters on the valid set.
Hardware Specification Yes These numbers are measured on a single Nvidia A100 (80G) SXM GPU. ... All experiments are run with 4 x A100 (80G). For the efficiency analysis, a single A100 GPU was used.
Software Dependencies No The paper mentions modifying code in 'Py Torch' but does not specify a version number for PyTorch or any other software dependencies.
Experiment Setup Yes We conducted training for 3 epochs with a batch size of 64. For both PEFT methods Sp FT and Lo RA we fine-tune with three ratios of trainable parameters (p = 10%, 1%, 0.1%). ... Table 6: Hyperparameter configurations of S2FT on various base models across three tasks. (Optimizer Adam W, LR 2e-4/1e-3/2e-5, LR Scheduler linear/cosine, Batch size 16/4, Warmup Steps 100/0, Epochs 3/1)