reproducibilityindex.ai

The Impact of Initialization on LoRA Finetuning Dynamics

Authors: Soufiane Hayou, Nikhil Ghosh, Bin Yu

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our theoretical analysis shows that the reason behind this might be that the first initialization allows the use of larger learning rates (without causing output instability) compared to the second initialization, resulting in more efficient learning of the first scheme. We validate our results with extensive experiments on LLMs.
Researcher Affiliation	Academia	Soufiane Hayou Simons Institute UC Berkeley hayou@berkeley.edu Nikhil Ghosh Dept of Statistics UC Berkeley nikhil_ghosh@berkeley.edu Bin Yu Dept of Statistics UC Berkeley binyu@berkeley.edu
Pseudocode	No	No pseudocode or algorithm blocks were found in the paper.
Open Source Code	Yes	We use Lo RA+ code [44] for our experiments (available at https://github.com/nikhil-ghosh-berkeley/loraplus).
Open Datasets	Yes	The GLUE benchmark (General Language Understanding Evaluation) consists of several language tasks that evaluate the understanding capabilities of langugage models [8]. Using Lo RA, we finetune Roberta-large from the Ro BERTa family [12] on MNLI, SST2, and QNLI tasks with varying learning rates η and initialization schemes (Init[A] or Init[B]).
Dataset Splits	No	No explicit validation dataset split information (percentages, counts, or methodology for predefined splits) was found. The toy model specifies "N = 1000 (train data size), and Ntest = 100 (test data size)" but does not mention a validation set. For LLM experiments, only training and test evaluation are mentioned.
Hardware Specification	Yes	GPUs. Nvidia A10 with 24GB VRAM.
Software Dependencies	No	The paper mentions using "Lo RA+ code [44]" but does not specify versions for ancillary software such as programming languages, deep learning frameworks (e.g., PyTorch, TensorFlow), or other libraries.
Experiment Setup	Yes	Training Alg Details: Model Roberta-Large, Learning Rates {2k 10 5, for k = 0, 1, 2, . . . , 10}, LR Schedule Linear with Warmup Ratio 0.06, Weight Decay 0.0, Train Batch Size 4, Number of Epochs 10 Lo RA Hyperparameters: Lo RA Rank 8, Lo RA Dropout 0.1, Target Modules query, value Other Hyperparameters: Sequence Length Ttarget = 128, Random Seeds 3, Precision FP16