Promptbreeder: Self-Referential Self-Improvement via Prompt Evolution
Authors: Chrisantha Fernando, Dylan Sunil Banarse, Henryk Michalewski, Simon Osindero, Tim Rocktäschel
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Promptbreeder outperforms state-of-the-art prompt strategies such as Chain-of-Thought and Plan-and-Solve Prompting on commonly used arithmetic and commonsense reasoning benchmarks. |
| Researcher Affiliation | Industry | 1Google Deep Mind, London. Correspondence to: Chrisantha Fernando <chrisantha@google.com>. |
| Pseudocode | No | The paper describes the method in prose and with diagrams, but does not include explicit pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not contain an explicit statement about the release of its source code or a link to a code repository for the methodology described. |
| Open Datasets | Yes | We use the datasets from state-of-the-art prompt strategies such as Plan-and-Solve, spanning arithmetic reasoning with GSM8K (Cobbe et al., 2021), SVAMP (Patel et al., 2021), Multi Arith (Roy & Roth, 2016), Add Sub (Hosseini et al., 2014), AQu A-RAT (Ling et al., 2017), and Single Eq (Koncel-Kedziorski et al., 2015), commonsense reasoning with Commonsense QA (CSQA, Talmor et al., 2019) and Strategy QA (SQA, Geva et al., 2021), instruction induction tasks from (Honovich et al., 2023), and hate speech classification on the ETHOS dataset (Mollas et al., 2022). |
| Dataset Splits | Yes | To evaluate the fitness of each evolved task-prompt, we sample a batch of 100 Q&A pairs from the entire training set of the domain at hand. ... Where datasets were not provided with a training/test split (Multi Arith, Add Sub, Single EQ and SVAMP) the dataset was split into two equal training and test sets before the experiments were conducted. |
| Hardware Specification | No | The paper mentions running '8-16 LLM models at once' and discussing 'GPT 3.5 results' but does not specify the underlying hardware (e.g., CPU, GPU models, memory) used for these computations. |
| Software Dependencies | No | The paper mentions using PaLM 2-L, GPT3.5-Turbo-0613, GPT3.5-Turbo-1106, and BERT, but does not provide specific version numbers for these or any other software dependencies. |
| Experiment Setup | Yes | We used a population size of 50 units, evolved for typically 20-30 generations... To evaluate the fitness of each evolved task-prompt, we sample a batch of 100 Q&A pairs... The maximum number of tokens sampled under each context was 50, 30 and 5 respectively. The temperature of the Inducer and Evaluator was set to 0.0 in all cases, but the temperature of the Redescriber was initialized from 1.0 to 2.0 and permitted to evolve... |