LoRA-GA: Low-Rank Adaptation with Gradient Approximation
Authors: Shaowen Wang, Linxi Yu, Jian Li
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our extensive experiments demonstrate that Lo RA-GA achieves a convergence rate comparable to that of full fine-tuning (hence being significantly faster than vanilla Lo RA as well as various recent improvements) while simultaneously attaining comparable or even better performance. |
| Researcher Affiliation | Academia | Shaowen Wang wangsw23@mails.tsinghua.edu.cn Linxi Yu yulx23@mails.tsinghua.edu.cn Jian Li lijian83@mail.tsinghua.edu.cn Tsinghua University Beijing, China |
| Pseudocode | Yes | Algorithm 1 Lo RA-GA Initialization Algorithm 2 Lo RA-GA Initialization With Gradient Accumulation |
| Open Source Code | Yes | Code is available at code. All codes of our experiments are uploaded to anonymous github. |
| Open Datasets | Yes | We fine-tune the T5-Base model on several datasets from the GLUE benchmark, including MNLI, SST-2, Co LA, QNLI, and MRPC. We train our model on a 52k subset of Wizard LM [42]... We train our model on a 100k subset of Meta Math QA [43]... We train our model on a 100k subset of Code-Feedback [44]... All datasets used in our experiments are all open source, which has been declared and cited in Section 1 (Introduction). |
| Dataset Splits | No | The paper mentions evaluating performance on a 'development set' and uses specific datasets for training and testing, but it does not explicitly provide precise percentages or sample counts for train/validation/test splits, nor does it reference standard splits with explicitly defined ratios for reproducibility. |
| Hardware Specification | Yes | We benchmark Lo RA-GA on a single RTX 3090 24GB GPU, a 128-core CPU, and 256GB of RAM. For the experiments on T5-Base using the GLUE dataset, reported in Section 4.1, all computations were performed on a single RTX 3090. For the Llama 2-7B experiments, reported in Section 4.2, full fine-tuning and Do RA scenarios were conducted on a single A100, while all other Lo RA variants and Lo RA-GA were executed on a single RTX 3090. |
| Software Dependencies | No | The paper mentions using PyTorch and PEFT but does not provide specific version numbers for these or any other software dependencies needed for reproducibility. |
| Experiment Setup | Yes | We utilize prompt tuning to fine-tune the T5-Base model on the GLUE benchmark... We provide the hyperparameters in Appendix D.1. Each experiment is conducted with 3 different random seeds, and the average performance is reported. Detailed hyperparameters can be found in Appendix D.2. Each experiment uses 3 different random seeds, and the average performance across these runs is reported. Training Algorithm: Adam W [49] with β1 = 0.9, β2 = 0.999, ϵ = 1e 8 and weight decay of 0. For full finetuning, Lo RA, and its variants, a learning rate of 1e 4 , a warmup ratio of 0.03, and cosine decay are employed. Lo RA Hyperparameters: Lo RA rank r = 8, α = 16. Lo RA-GA Hyperparameter: γ = 16, sampled batch size sbs = 8 |