S$^{2}$FT: Efficient, Scalable and Generalizable LLM Fine-tuning by Structured Sparsity
Authors: Xinyu Yang, Jixuan Leng, Geyang Guo, Jiawei Zhao, Ryumei Nakada, Linjun Zhang, Huaxiu Yao, Beidi Chen
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through theoretical analyses and empirical results, our method prevents forgetting while simplifying optimization, delivers SOTA performance on both commonsense and arithmetic reasoning with 4.6% and 1.3% average improvements compared to Lo RA, and surpasses full FT by 11.5% when generalizing to various domains after instruction tuning. Using our partial back-propagation algorithm, S2FT saves training memory up to 3 and improves latency by 1.5-2.7 compared to full FT, while achieving an average 10% improvement over Lo RA on both metrics. |
| Researcher Affiliation | Academia | Xinyu Yang1, Jixuan Leng1, Geyang Guo2, Jiawei Zhao3, Ryumei Nakada4, Linjun Zhang4, Huaxiu Yao5, Beidi Chen1 1CMU, 2Georgia Tech, 3Caltech, 4Rutgers, 5UNC-Chapel Hill |
| Pseudocode | Yes | def setup_context(ctx, inputs, output): activation, weight, bias, start, end = inputs # only save partial input tensors for gradient calculation in forward ctx.save_for_backward(activation[:, start:end], weight, bias, start, end) def gradient_update(parameter, gradient, start, end): # only modify the assigned positions of weight matrices during optimization parameter[:, start:end].add_(gradient) |
| Open Source Code | No | We have the code required to reproduce our experimental results and are working towards making our code available in a public Git Hub repository. |
| Open Datasets | Yes | We fine-tune the LLa MA-3-8B model on the Math10K dataset [28]. The Math10K dataset combines training sets from GSM8K [14], MAWPS [32], and AQu A [37], augmented with chain-of-thought steps generated by language models. ... The commonsense reasoning dataset comprise eight subsets: Bool Q [12], PIQA [9], Social QA [58], Hella Swag [76], Wino Grande [57], ARC-challenge [13], ARC-easy [13], and Openbook QA [46]. |
| Dataset Splits | Yes | For PEFT methods, we set three ratios of trainable parameters (p = 10%, 1%, 0.1%) and search for the optimal hyperparameters on the valid set. |
| Hardware Specification | Yes | These numbers are measured on a single Nvidia A100 (80G) SXM GPU. ... All experiments are run with 4 x A100 (80G). For the efficiency analysis, a single A100 GPU was used. |
| Software Dependencies | No | The paper mentions modifying code in 'Py Torch' but does not specify a version number for PyTorch or any other software dependencies. |
| Experiment Setup | Yes | We conducted training for 3 epochs with a batch size of 64. For both PEFT methods Sp FT and Lo RA we fine-tune with three ratios of trainable parameters (p = 10%, 1%, 0.1%). ... Table 6: Hyperparameter configurations of S2FT on various base models across three tasks. (Optimizer Adam W, LR 2e-4/1e-3/2e-5, LR Scheduler linear/cosine, Batch size 16/4, Warmup Steps 100/0, Epochs 3/1) |