$\textit{Trans-LoRA}$: towards data-free Transferable Parameter Efficient Finetuning
Authors: Runqian Wang, Soumya Ghosh, David Cox, Diego Antognini, Aude Oliva, Rogerio Feris, Leonid Karlinsky
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We perform numerous experiments confirming that our Trans-Lo RA achieves the above guarantees while transferring within and across the popular Llama2 [15] and Gemma [14] model families, popular Lo RA [31], Do RA [43], and Prompt Tuning [37] PEFT variants, and using a large variety of about 90 (language, code, and math) tasks contained in popular datasets such as BBH [60], MMLU [28], GSM8K [10], MBPP [5], and MBPP+ [41]. |
| Researcher Affiliation | Collaboration | Runqian Wang raywang4@mit.edu Soumya Ghosh ghoshso@us.ibm.com David Cox david.d.cox@ibm.com Diego Antognini diego.antognini@ibm.com Aude Oliva oliva@mit.edu Rogerio Feris rsferis@us.ibm.com Leonid Karlinsky leonidka@ibm.com MIT MIT-IBM Watson AI Lab Work done while at MIT-IBM Watson AI Lab |
| Pseudocode | Yes | We summarize our overall Trans-Lo RA algorithm in Algorithm 1 and Figure 2. Algorithm 1 Trans-Lo RA Require: D, θs, Mt, Mϕ disc Mgen Mt Dsyn while |Dsyn|<|D| do s generate(Mgen, D) if verify(Mϕ disc, s) then Dsyn Dsyn {s} end if end while Initialize θt H Cross Entropy Loss() while θt not converged do L H(θt(Dsyn),θs(Dsyn)) θt update(θt,L) end while |
| Open Source Code | Yes | Our code is available at https://github.com/raywang4/Trans Lo RA. |
| Open Datasets | Yes | We have evaluated the effectiveness of our Trans-Lo RA on two popular LLM families: Llama-2 [15] and Gemma [14], using 86 tasks from a large variety of topics from the following popular benchmarks: Big Bench-Hard (BBH)[60] (27 reasoning tasks), Massive Multitask Language Understanding (MMLU)[28] (57 knowledge tasks), Mostly Basic Python Problems (MBPP)[5] (1 code task), and Grade School Math 8K (GSM8K)[10] (1 math task). |
| Dataset Splits | No | The paper mentions using a "validation set" for hyperparameter search ("Hyperparameter-wise, we search the learning rate between 2 10 4 and 2 10 5 on the validation set using the Adam W optimizer..."), but it does not specify the size or percentage of this validation set in relation to the training and test sets, nor does it provide the full data split for reproduction. |
| Hardware Specification | Yes | We run on 1 V100 40GB GPU per transfer task. Each task takes on average 10 hours to finish. |
| Software Dependencies | No | The paper mentions software components such as "Adam W optimizer", "LM-Eval Harness [18]", and "Evalplus [40]", but it does not provide specific version numbers for these or any other software dependencies. |
| Experiment Setup | Yes | Hyperparameter-wise, we search the learning rate between 2 10 4 and 2 10 5 on the validation set using the Adam W optimizer with no weight decay and a linear learning rate scheduler without warmup. We end up adopting 2 10 4 for MMLU and 2 10 5 for all other tasks. We use a fixed 20 epochs for BBH, MBPP, and GSM8K and 10 epochs for MMLU. We train on the default Lo RA configuration (adapters built only on query and value matrices of attention block) with effective batch size 8 (gradient accumulation used for larger models). |