LoRA-GA: Low-Rank Adaptation with Gradient Approximation

Authors: Shaowen Wang, Linxi Yu, Jian Li

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our extensive experiments demonstrate that Lo RA-GA achieves a convergence rate comparable to that of full fine-tuning (hence being significantly faster than vanilla Lo RA as well as various recent improvements) while simultaneously attaining comparable or even better performance.
Researcher Affiliation Academia Shaowen Wang wangsw23@mails.tsinghua.edu.cn Linxi Yu yulx23@mails.tsinghua.edu.cn Jian Li lijian83@mail.tsinghua.edu.cn Tsinghua University Beijing, China
Pseudocode Yes Algorithm 1 Lo RA-GA Initialization Algorithm 2 Lo RA-GA Initialization With Gradient Accumulation
Open Source Code Yes Code is available at code. All codes of our experiments are uploaded to anonymous github.
Open Datasets Yes We fine-tune the T5-Base model on several datasets from the GLUE benchmark, including MNLI, SST-2, Co LA, QNLI, and MRPC. We train our model on a 52k subset of Wizard LM [42]... We train our model on a 100k subset of Meta Math QA [43]... We train our model on a 100k subset of Code-Feedback [44]... All datasets used in our experiments are all open source, which has been declared and cited in Section 1 (Introduction).
Dataset Splits No The paper mentions evaluating performance on a 'development set' and uses specific datasets for training and testing, but it does not explicitly provide precise percentages or sample counts for train/validation/test splits, nor does it reference standard splits with explicitly defined ratios for reproducibility.
Hardware Specification Yes We benchmark Lo RA-GA on a single RTX 3090 24GB GPU, a 128-core CPU, and 256GB of RAM. For the experiments on T5-Base using the GLUE dataset, reported in Section 4.1, all computations were performed on a single RTX 3090. For the Llama 2-7B experiments, reported in Section 4.2, full fine-tuning and Do RA scenarios were conducted on a single A100, while all other Lo RA variants and Lo RA-GA were executed on a single RTX 3090.
Software Dependencies No The paper mentions using PyTorch and PEFT but does not provide specific version numbers for these or any other software dependencies needed for reproducibility.
Experiment Setup Yes We utilize prompt tuning to fine-tune the T5-Base model on the GLUE benchmark... We provide the hyperparameters in Appendix D.1. Each experiment is conducted with 3 different random seeds, and the average performance is reported. Detailed hyperparameters can be found in Appendix D.2. Each experiment uses 3 different random seeds, and the average performance across these runs is reported. Training Algorithm: Adam W [49] with β1 = 0.9, β2 = 0.999, ϵ = 1e 8 and weight decay of 0. For full finetuning, Lo RA, and its variants, a learning rate of 1e 4 , a warmup ratio of 0.03, and cosine decay are employed. Lo RA Hyperparameters: Lo RA rank r = 8, α = 16. Lo RA-GA Hyperparameter: γ = 16, sampled batch size sbs = 8