reproducibilityindex.ai

TEMPERA: Test-Time Prompt Editing via Reinforcement Learning

Authors: Tianjun Zhang, Xuezhi Wang, Denny Zhou, Dale Schuurmans, Joseph E. Gonzalez

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We show that TEMPERA can achieve So TA performance (e.g., 1.8% better in SST-2 and 3.9% better in CR) compared to few-shot finetuning, prompt tuning and discrete prompt optimization. We also show that TEMPERA is on 4x more data efficient (over the average of 4 tasks SST2, MR, AG News and RTE) compared with traditional finetuning methods (Figure 1). In addition, we perform extensive ablations on different aspects of the proposed algorithm.
Researcher Affiliation	Collaboration	Tianjun Zhang1 Xuezhi Wang2 Denny Zhou2 Dale Schuurmans2, 3 Joseph E. Gonzalez1 1 UC Berkeley 2 Google Research, Brain Team 3 University of Alberta
Pseudocode	Yes	Algorithm 1 Test-Time Prompt Editing with TEMPERA
Open Source Code	Yes	Our code is available at https://github.com/tianjunz/TEMPERA.
Open Datasets	Yes	Most of the tasks are from the standard GLUE (Wang et al., 2018). ... We test TEMPERA on few-shot text classification tasks... including single-sentence tasks (e.g., sentiment analysis including SST-2, Yelp reviews, MR, CR, topic classification including AG News).
Dataset Splits	Yes	We also randomly sample 16 samples per class as the validation dataset.
Hardware Specification	No	The paper mentions the use of "Ro BERTalarge" for the language model, but it does not specify any hardware components like CPU or GPU models, or details about the computing environment used for experiments.
Software Dependencies	No	The paper mentions software like "PPO algorithm", "NLTK", and "huggingface", but it does not provide specific version numbers for these or any other software dependencies.
Experiment Setup	Yes	Table 8: Hyperparameters used for TEMPERA in all the tasks. (e.g., Steps per training 8, Learning rate 0.00005, Gamma 0.99). For the Finetuning, we use standard finetuning of the Ro BERTa model from huggingface for 100 epochs, a learning rate of 0.0003 and the optimizer of Adam.