reproducibilityindex.ai

Promptbreeder: Self-Referential Self-Improvement via Prompt Evolution

Authors: Chrisantha Fernando, Dylan Sunil Banarse, Henryk Michalewski, Simon Osindero, Tim Rocktäschel

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Promptbreeder outperforms state-of-the-art prompt strategies such as Chain-of-Thought and Plan-and-Solve Prompting on commonly used arithmetic and commonsense reasoning benchmarks.
Researcher Affiliation	Industry	1Google Deep Mind, London. Correspondence to: Chrisantha Fernando <chrisantha@google.com>.
Pseudocode	No	The paper describes the method in prose and with diagrams, but does not include explicit pseudocode or algorithm blocks.
Open Source Code	No	The paper does not contain an explicit statement about the release of its source code or a link to a code repository for the methodology described.
Open Datasets	Yes	We use the datasets from state-of-the-art prompt strategies such as Plan-and-Solve, spanning arithmetic reasoning with GSM8K (Cobbe et al., 2021), SVAMP (Patel et al., 2021), Multi Arith (Roy & Roth, 2016), Add Sub (Hosseini et al., 2014), AQu A-RAT (Ling et al., 2017), and Single Eq (Koncel-Kedziorski et al., 2015), commonsense reasoning with Commonsense QA (CSQA, Talmor et al., 2019) and Strategy QA (SQA, Geva et al., 2021), instruction induction tasks from (Honovich et al., 2023), and hate speech classiﬁcation on the ETHOS dataset (Mollas et al., 2022).
Dataset Splits	Yes	To evaluate the ﬁtness of each evolved task-prompt, we sample a batch of 100 Q&A pairs from the entire training set of the domain at hand. ... Where datasets were not provided with a training/test split (Multi Arith, Add Sub, Single EQ and SVAMP) the dataset was split into two equal training and test sets before the experiments were conducted.
Hardware Specification	No	The paper mentions running '8-16 LLM models at once' and discussing 'GPT 3.5 results' but does not specify the underlying hardware (e.g., CPU, GPU models, memory) used for these computations.
Software Dependencies	No	The paper mentions using PaLM 2-L, GPT3.5-Turbo-0613, GPT3.5-Turbo-1106, and BERT, but does not provide specific version numbers for these or any other software dependencies.
Experiment Setup	Yes	We used a population size of 50 units, evolved for typically 20-30 generations... To evaluate the ﬁtness of each evolved task-prompt, we sample a batch of 100 Q&A pairs... The maximum number of tokens sampled under each context was 50, 30 and 5 respectively. The temperature of the Inducer and Evaluator was set to 0.0 in all cases, but the temperature of the Redescriber was initialized from 1.0 to 2.0 and permitted to evolve...