$\textit{Trans-LoRA}$: towards data-free Transferable Parameter Efficient Finetuning

Authors: Runqian Wang, Soumya Ghosh, David Cox, Diego Antognini, Aude Oliva, Rogerio Feris, Leonid Karlinsky

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We perform numerous experiments confirming that our Trans-Lo RA achieves the above guarantees while transferring within and across the popular Llama2 [15] and Gemma [14] model families, popular Lo RA [31], Do RA [43], and Prompt Tuning [37] PEFT variants, and using a large variety of about 90 (language, code, and math) tasks contained in popular datasets such as BBH [60], MMLU [28], GSM8K [10], MBPP [5], and MBPP+ [41].
Researcher Affiliation Collaboration Runqian Wang raywang4@mit.edu Soumya Ghosh ghoshso@us.ibm.com David Cox david.d.cox@ibm.com Diego Antognini diego.antognini@ibm.com Aude Oliva oliva@mit.edu Rogerio Feris rsferis@us.ibm.com Leonid Karlinsky leonidka@ibm.com MIT MIT-IBM Watson AI Lab Work done while at MIT-IBM Watson AI Lab
Pseudocode Yes We summarize our overall Trans-Lo RA algorithm in Algorithm 1 and Figure 2. Algorithm 1 Trans-Lo RA Require: D, θs, Mt, Mϕ disc Mgen Mt Dsyn while |Dsyn|<|D| do s generate(Mgen, D) if verify(Mϕ disc, s) then Dsyn Dsyn {s} end if end while Initialize θt H Cross Entropy Loss() while θt not converged do L H(θt(Dsyn),θs(Dsyn)) θt update(θt,L) end while
Open Source Code Yes Our code is available at https://github.com/raywang4/Trans Lo RA.
Open Datasets Yes We have evaluated the effectiveness of our Trans-Lo RA on two popular LLM families: Llama-2 [15] and Gemma [14], using 86 tasks from a large variety of topics from the following popular benchmarks: Big Bench-Hard (BBH)[60] (27 reasoning tasks), Massive Multitask Language Understanding (MMLU)[28] (57 knowledge tasks), Mostly Basic Python Problems (MBPP)[5] (1 code task), and Grade School Math 8K (GSM8K)[10] (1 math task).
Dataset Splits No The paper mentions using a "validation set" for hyperparameter search ("Hyperparameter-wise, we search the learning rate between 2 10 4 and 2 10 5 on the validation set using the Adam W optimizer..."), but it does not specify the size or percentage of this validation set in relation to the training and test sets, nor does it provide the full data split for reproduction.
Hardware Specification Yes We run on 1 V100 40GB GPU per transfer task. Each task takes on average 10 hours to finish.
Software Dependencies No The paper mentions software components such as "Adam W optimizer", "LM-Eval Harness [18]", and "Evalplus [40]", but it does not provide specific version numbers for these or any other software dependencies.
Experiment Setup Yes Hyperparameter-wise, we search the learning rate between 2 10 4 and 2 10 5 on the validation set using the Adam W optimizer with no weight decay and a linear learning rate scheduler without warmup. We end up adopting 2 10 4 for MMLU and 2 10 5 for all other tasks. We use a fixed 20 epochs for BBH, MBPP, and GSM8K and 10 epochs for MMLU. We train on the default Lo RA configuration (adapters built only on query and value matrices of attention block) with effective batch size 8 (gradient accumulation used for larger models).