The Impact of Initialization on LoRA Finetuning Dynamics
Authors: Soufiane Hayou, Nikhil Ghosh, Bin Yu
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our theoretical analysis shows that the reason behind this might be that the first initialization allows the use of larger learning rates (without causing output instability) compared to the second initialization, resulting in more efficient learning of the first scheme. We validate our results with extensive experiments on LLMs. |
| Researcher Affiliation | Academia | Soufiane Hayou Simons Institute UC Berkeley hayou@berkeley.edu Nikhil Ghosh Dept of Statistics UC Berkeley nikhil_ghosh@berkeley.edu Bin Yu Dept of Statistics UC Berkeley binyu@berkeley.edu |
| Pseudocode | No | No pseudocode or algorithm blocks were found in the paper. |
| Open Source Code | Yes | We use Lo RA+ code [44] for our experiments (available at https://github.com/nikhil-ghosh-berkeley/loraplus). |
| Open Datasets | Yes | The GLUE benchmark (General Language Understanding Evaluation) consists of several language tasks that evaluate the understanding capabilities of langugage models [8]. Using Lo RA, we finetune Roberta-large from the Ro BERTa family [12] on MNLI, SST2, and QNLI tasks with varying learning rates η and initialization schemes (Init[A] or Init[B]). |
| Dataset Splits | No | No explicit validation dataset split information (percentages, counts, or methodology for predefined splits) was found. The toy model specifies "N = 1000 (train data size), and Ntest = 100 (test data size)" but does not mention a validation set. For LLM experiments, only training and test evaluation are mentioned. |
| Hardware Specification | Yes | GPUs. Nvidia A10 with 24GB VRAM. |
| Software Dependencies | No | The paper mentions using "Lo RA+ code [44]" but does not specify versions for ancillary software such as programming languages, deep learning frameworks (e.g., PyTorch, TensorFlow), or other libraries. |
| Experiment Setup | Yes | Training Alg Details: Model Roberta-Large, Learning Rates {2k 10 5, for k = 0, 1, 2, . . . , 10}, LR Schedule Linear with Warmup Ratio 0.06, Weight Decay 0.0, Train Batch Size 4, Number of Epochs 10 Lo RA Hyperparameters: Lo RA Rank 8, Lo RA Dropout 0.1, Target Modules query, value Other Hyperparameters: Sequence Length Ttarget = 128, Random Seeds 3, Precision FP16 |