LoRA Training in the NTK Regime has No Spurious Local Minima
Authors: Uijeong Jang, Jason D. Lee, Ernest K. Ryu
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this work, we theoretically analyze Lo RA fine-tuning and present results on trainability and generalizability. We consider fine-tuning a deep (transformer) neural network with K-dimensional outputs using N training (fine-tuning) data points. Assuming that training remains under the NTK regime, which we soon define and justify in Section 2, we show the following. First, full fine-tuning (without Lo RA) admits a rank-r solution such that r(r+1) 2 KN. Second, using Lo RA with rank r such that r(r+1) 2 > KN eliminates spurious local minima, allowing (stochastic) gradient descent to find the low-rank solutions. Finally, the low-rank solution found using Lo RA generalizes well. [...] In this section, we conduct simple experiments on finetuning linearized pre-trained models to validate our theory.1 |
| Researcher Affiliation | Academia | Uijeong Jang 1 Jason D. Lee 2 Ernest K. Ryu 3 1Department of Mathematical Sciences, Seoul National University 2Department of Electrical and Computer Engineering, Princeton University 3Department of Mathematics, University of California, Los Angeles. |
| Pseudocode | No | The paper presents mathematical proofs and theoretical analysis but does not include any pseudocode or algorithm blocks. |
| Open Source Code | Yes | 1Code available at https://github.com/UijeongJang/LoRA-NTK. |
| Open Datasets | Yes | We present the results of six NLP tasks that were also considered in (Malladi et al., 2023): sentiment analysis (SST-2, MR, CR), natural language inference (QNLI), subjectivity (Subj), and paraphrase detection (QQP). We use a pre-trained vision transformer (Dosovitskiy et al., 2021) and fine-tune it on the bean disease dataset (Makerere AI Lab, 2020) to perform an image classification task with 3 labels. [...] For speech classification, we use a pre-trained wav2vec2 (Baevski et al., 2020) model and fine-tune it on a SUPERB dataset (Yang et al., 2021) to perform a speech classification task with 4 labels. |
| Dataset Splits | No | The paper mentions training data size and test set evaluation, but no explicit mention of a separate validation dataset or its split proportions. For example, 'We use the empirical risk... {(Xi, Yi)}N i=1, where N is the number of (fine-tuning) training data.' and 'evaluations on a test set of 1000 samples during training'. |
| Hardware Specification | No | The paper mentions using specific pre-trained models like 'RoBERTa-base', 'vision transformer', and 'wav2vec2' but does not specify any hardware details like GPU models, CPU types, or memory used for the experiments. |
| Software Dependencies | No | The paper does not provide specific software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions). |
| Experiment Setup | Yes | Experimental setup on NLP tasks. We use prompt-based fine-tuning (Schick & Sch utze, 2021; Gao et al., 2021) and consider the same architecture and dataset as in (Malladi et al., 2023)... We optimize a linearized Ro BERTa-base (Liu et al., 2019) model with dataset of size 32 (N = 32) with two labels (K = 2) using cross entropy loss. [...] Additional information is in Table 1. [...] Table 1. Hyperparameters on experiment in Section 6 (NLP tasks) Task SST-2,QNLI MR,CR,QQP,Subj Batch size 32 32 Learning rate (Full, Lo RA fine tuning) 0.0005 0.001 Trained layer Wq, Wv (last layer only) Wq, Wv (last layer only) Weight decay 0.01 0.01 |