Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Prompt Tuning Strikes Back: Customizing Foundation Models with Low-Rank Prompt Adaptation
Authors: Abhinav Jain, Swarat Chaudhuri, Thomas Reps, Christopher Jermaine
NeurIPS 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conducted extensive experiments on various models. These included six benchmark NLU tasks from the GLUE dataset and three Code Understanding and Generation tasks. Our results show that Lo PA outperforms existing prompt-tuning methods. It often matches the performance of full fine-tuning and Lo RA. In 11 out of 24 test cases, we found Lo PA outperformed Lo RA. |
| Researcher Affiliation | Academia | Abhinav Jain Department of Computer Science Rice University EMAIL Swarat Chaudhuri Department of Computer Science UT Austin EMAIL Thomas Reps Department of Computer Science University of Wisconsin-Madison EMAIL Chris Jermaine Department of Computer Science Rice University EMAIL |
| Pseudocode | No | The paper describes its proposed method using text, mathematical equations, and diagrams (e.g., Figure 2), but it does not include a clearly labeled 'Pseudocode' or 'Algorithm' block, nor does it present structured steps in a code-like format. |
| Open Source Code | Yes | 1The code for Lo PA can be found here |
| Open Datasets | Yes | We evaluate Lo PA on (i) six Natural Language Understanding (NLU) tasks from the GLUE benchmark [34] namely, SST-2 [31], MNLI [37], MRPC [3], QNLI [29], QQP, and RTE [5]; (ii) a code-generation task that requires the model to complete method bodies from MBPP benchmark [1], and (iii) two code-understanding tasks namely, Crux Eval-I (input prediction) and Crux Eval-O (output prediction) from Crux Eval benchmark [6]. |
| Dataset Splits | Yes | For the GLUE tasks, we use the train-test splits pre-defined in the benchmark, while for the MBPP and Crux Eval tasks, we employ a 50-50 split. Also, 'Validation Accuracy' is plotted in figures 6, 7, 8. |
| Hardware Specification | Yes | All experiments are conducted on 40GB 2x A100 GPUs. |
| Software Dependencies | No | The paper does not explicitly list specific version numbers for software dependencies such as programming languages, libraries, or frameworks (e.g., Python, PyTorch, TensorFlow, or Scikit-learn versions). |
| Experiment Setup | Yes | For NLU tasks, training with FFT and Lo RA was done for 10 epochs, while with prompt-tuning-based approaches it was done for 20 epochs. In MBPP, all foundation model (FM) backbones were trained for 10 epochs across all tuning methods. In Crux Eval Tasks across all PEFT methods, FM backbones under 7B were trained for 20 epochs, while larger FMs ( 7B) were trained for 10 epochs. Lastly, training with FFT on Crux Eval tasks was done for 5 epochs. The learning rates for Lo PA are set to 1 x 10^-5 in NLU and 1 x 10^-3 in Coding tasks. The baseline tuning methods use the following learning rates across all the tasks: FFT using 1 x 10^-5, Lo RA and the remainder of soft-prompting approaches using 1 x 10^-4. |