The Impact of Initialization on LoRA Finetuning Dynamics

Authors: Soufiane Hayou, Nikhil Ghosh, Bin Yu

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our theoretical analysis shows that the reason behind this might be that the first initialization allows the use of larger learning rates (without causing output instability) compared to the second initialization, resulting in more efficient learning of the first scheme. We validate our results with extensive experiments on LLMs.
Researcher Affiliation Academia Soufiane Hayou Simons Institute UC Berkeley hayou@berkeley.edu Nikhil Ghosh Dept of Statistics UC Berkeley nikhil_ghosh@berkeley.edu Bin Yu Dept of Statistics UC Berkeley binyu@berkeley.edu
Pseudocode No No pseudocode or algorithm blocks were found in the paper.
Open Source Code Yes We use Lo RA+ code [44] for our experiments (available at https://github.com/nikhil-ghosh-berkeley/loraplus).
Open Datasets Yes The GLUE benchmark (General Language Understanding Evaluation) consists of several language tasks that evaluate the understanding capabilities of langugage models [8]. Using Lo RA, we finetune Roberta-large from the Ro BERTa family [12] on MNLI, SST2, and QNLI tasks with varying learning rates η and initialization schemes (Init[A] or Init[B]).
Dataset Splits No No explicit validation dataset split information (percentages, counts, or methodology for predefined splits) was found. The toy model specifies "N = 1000 (train data size), and Ntest = 100 (test data size)" but does not mention a validation set. For LLM experiments, only training and test evaluation are mentioned.
Hardware Specification Yes GPUs. Nvidia A10 with 24GB VRAM.
Software Dependencies No The paper mentions using "Lo RA+ code [44]" but does not specify versions for ancillary software such as programming languages, deep learning frameworks (e.g., PyTorch, TensorFlow), or other libraries.
Experiment Setup Yes Training Alg Details: Model Roberta-Large, Learning Rates {2k 10 5, for k = 0, 1, 2, . . . , 10}, LR Schedule Linear with Warmup Ratio 0.06, Weight Decay 0.0, Train Batch Size 4, Number of Epochs 10 Lo RA Hyperparameters: Lo RA Rank 8, Lo RA Dropout 0.1, Target Modules query, value Other Hyperparameters: Sequence Length Ttarget = 128, Random Seeds 3, Precision FP16