Promptbreeder: Self-Referential Self-Improvement via Prompt Evolution

Authors: Chrisantha Fernando, Dylan Sunil Banarse, Henryk Michalewski, Simon Osindero, Tim Rocktäschel

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Promptbreeder outperforms state-of-the-art prompt strategies such as Chain-of-Thought and Plan-and-Solve Prompting on commonly used arithmetic and commonsense reasoning benchmarks.
Researcher Affiliation Industry 1Google Deep Mind, London. Correspondence to: Chrisantha Fernando <chrisantha@google.com>.
Pseudocode No The paper describes the method in prose and with diagrams, but does not include explicit pseudocode or algorithm blocks.
Open Source Code No The paper does not contain an explicit statement about the release of its source code or a link to a code repository for the methodology described.
Open Datasets Yes We use the datasets from state-of-the-art prompt strategies such as Plan-and-Solve, spanning arithmetic reasoning with GSM8K (Cobbe et al., 2021), SVAMP (Patel et al., 2021), Multi Arith (Roy & Roth, 2016), Add Sub (Hosseini et al., 2014), AQu A-RAT (Ling et al., 2017), and Single Eq (Koncel-Kedziorski et al., 2015), commonsense reasoning with Commonsense QA (CSQA, Talmor et al., 2019) and Strategy QA (SQA, Geva et al., 2021), instruction induction tasks from (Honovich et al., 2023), and hate speech classification on the ETHOS dataset (Mollas et al., 2022).
Dataset Splits Yes To evaluate the fitness of each evolved task-prompt, we sample a batch of 100 Q&A pairs from the entire training set of the domain at hand. ... Where datasets were not provided with a training/test split (Multi Arith, Add Sub, Single EQ and SVAMP) the dataset was split into two equal training and test sets before the experiments were conducted.
Hardware Specification No The paper mentions running '8-16 LLM models at once' and discussing 'GPT 3.5 results' but does not specify the underlying hardware (e.g., CPU, GPU models, memory) used for these computations.
Software Dependencies No The paper mentions using PaLM 2-L, GPT3.5-Turbo-0613, GPT3.5-Turbo-1106, and BERT, but does not provide specific version numbers for these or any other software dependencies.
Experiment Setup Yes We used a population size of 50 units, evolved for typically 20-30 generations... To evaluate the fitness of each evolved task-prompt, we sample a batch of 100 Q&A pairs... The maximum number of tokens sampled under each context was 50, 30 and 5 respectively. The temperature of the Inducer and Evaluator was set to 0.0 in all cases, but the temperature of the Redescriber was initialized from 1.0 to 2.0 and permitted to evolve...