reproducibilityindex.ai

Prompt Tuning Strikes Back: Customizing Foundation Models with Low-Rank Prompt Adaptation

Authors: Abhinav Jain, Swarat Chaudhuri, Thomas Reps, Christopher Jermaine

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conducted extensive experiments on various models. These included six benchmark NLU tasks from the GLUE dataset and three Code Understanding and Generation tasks. Our results show that Lo PA outperforms existing prompt-tuning methods. It often matches the performance of full fine-tuning and Lo RA. In 11 out of 24 test cases, we found Lo PA outperformed Lo RA.
Researcher Affiliation	Academia	Abhinav Jain Department of Computer Science Rice University aj70@rice.edu Swarat Chaudhuri Department of Computer Science UT Austin swarat@cs.utexas.edu Thomas Reps Department of Computer Science University of Wisconsin-Madison reps@cs.wisc.edu Chris Jermaine Department of Computer Science Rice University cmj4@rice.edu
Pseudocode	No	The paper describes its proposed method using text, mathematical equations, and diagrams (e.g., Figure 2), but it does not include a clearly labeled 'Pseudocode' or 'Algorithm' block, nor does it present structured steps in a code-like format.
Open Source Code	Yes	1The code for Lo PA can be found here
Open Datasets	Yes	We evaluate Lo PA on (i) six Natural Language Understanding (NLU) tasks from the GLUE benchmark [34] namely, SST-2 [31], MNLI [37], MRPC [3], QNLI [29], QQP, and RTE [5]; (ii) a code-generation task that requires the model to complete method bodies from MBPP benchmark [1], and (iii) two code-understanding tasks namely, Crux Eval-I (input prediction) and Crux Eval-O (output prediction) from Crux Eval benchmark [6].
Dataset Splits	Yes	For the GLUE tasks, we use the train-test splits pre-defined in the benchmark, while for the MBPP and Crux Eval tasks, we employ a 50-50 split. Also, 'Validation Accuracy' is plotted in figures 6, 7, 8.
Hardware Specification	Yes	All experiments are conducted on 40GB 2x A100 GPUs.
Software Dependencies	No	The paper does not explicitly list specific version numbers for software dependencies such as programming languages, libraries, or frameworks (e.g., Python, PyTorch, TensorFlow, or Scikit-learn versions).
Experiment Setup	Yes	For NLU tasks, training with FFT and Lo RA was done for 10 epochs, while with prompt-tuning-based approaches it was done for 20 epochs. In MBPP, all foundation model (FM) backbones were trained for 10 epochs across all tuning methods. In Crux Eval Tasks across all PEFT methods, FM backbones under 7B were trained for 20 epochs, while larger FMs ( 7B) were trained for 10 epochs. Lastly, training with FFT on Crux Eval tasks was done for 5 epochs. The learning rates for Lo PA are set to 1 x 10^-5 in NLU and 1 x 10^-3 in Coding tasks. The baseline tuning methods use the following learning rates across all the tasks: FFT using 1 x 10^-5, Lo RA and the remainder of soft-prompting approaches using 1 x 10^-4.