Compressible Dynamics in Deep Overparameterized Low-Rank Learning & Adaptation

Authors: Can Yaras, Peng Wang, Laura Balzano, Qing Qu

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In practice, we demonstrate the effectiveness of this approach for deep low-rank matrix completion as well as fine-tuning language models. ... We validate the effectiveness of Deep Lo RA on natural language tasks, particularly when fine-tuning with limited data.
Researcher Affiliation Academia Can Yaras 1 Peng Wang 1 Laura Balzano 1 Qing Qu 1 1EECS Department, University of Michigan, Ann Arbor, USA.
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code Yes The code can be found at https://github.com/cjyaras/deep-lora-transformers.
Open Datasets Yes We first evaluate our approach on tasks in the GLUE benchmark (Wang et al., 2018)...Applying the same sampling procedure as in the prior study to the STS-B dataset (Cer et al., 2017)...we test the effectiveness of Deep Lo RA compared to vanilla Lo RA for few-shot fine-tuning for natural language generation (NLG), specifically on the E2E dataset (Novikova et al., 2017).
Dataset Splits Yes To test the performance in a limited data setting, for one given trial of a single task, we randomly sample 1024 examples from the task data for fine-tuning, and compare the difference in performance on the same train set between Deep Lo RA and vanilla Lo RA on the entire validation split.
Hardware Specification Yes All experiments are carried out on a single NVIDIA Tesla V100 GPU, with time and memory usage reported in Table 2.
Software Dependencies No The paper mentions software like "BERT", "T5 models", "transformers library (Wolf et al., 2019)", and "Adam (Kingma & Ba, 2014) as an optimizer", but it does not provide specific version numbers for these software components or libraries.
Experiment Setup Yes We choose the best learning rate for each method from η {10 5, 10 4, 10 3, 10 2} on STS-B with 1024 samples, and find that η = 10 4 and α = 8 works best for vanilla Lo RA, while η = 10 2 with γ = 10 2 works best for Deep Lo RA...We use a maximum sequence length of 128 tokens for all tasks. ...We use a train batch size of 16, and train all models until convergence in train loss, and use the final model checkpoint for evaluation. ...Vanilla Lo RA is initialized in the same fashion as the original paper (i.e., W (2) k is initialized to all zeros, W (1) k is initialized to be Gaussian with standard deviation 1), whereas Deep Lo RA is compressed from a full-width 3-layer factorization with orthogonal initialization of scale ϵl = 10 3.