reproducibilityindex.ai

Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers

Authors: Qingyan Guo, Rui Wang, Junliang Guo, Bei Li, Kaitao Song, Xu Tan, Guoqing Liu, Jiang Bian, Yujiu Yang

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We optimize prompts for both closedand open-source LLMs including GPT-3.5 and Alpaca, on 31 datasets covering language understanding, generation tasks, as well as BIG-Bench Hard (BBH) tasks. EVOPROMPT significantly outperforms human-engineered prompts and existing methods for automatic prompt generation (e.g., up to 25% on BBH).
Researcher Affiliation	Collaboration	1Tsinghua University 2Microsoft Research 3Northeastern University
Pseudocode	Yes	Algorithm 1 Discrete prompt optimization: EVOPROMPT
Open Source Code	Yes	Our code is available at https://github.com/beeevita/Evo Prompt.
Open Datasets	Yes	We first conduct experiments on language understanding tasks across 7 datasets to validate our methods, including sentiment classification (SST-2 (Socher et al., 2013), MR (PANG, 2005), CR (Hu & Liu, 2004), SST-5 (Socher et al., 2013)), topic classification (AG s News (Zhang et al., 2015), TREC (Voorhees & Tice, 2000)) and subjectivity classification (Subj (Pang & Lee, 2004)). For summarization, we adopt SAMSum (Gliwa et al., 2019)... for text simplification... we employ the ASSET dataset (Alva-Manchego et al., 2020)... we apply BBH (Suzgun et al., 2022).
Dataset Splits	Yes	Specifically, abstaining from any gradients or parameters, EVOPROMPT starts from a population of prompts and iteratively generates new prompts with LLMs based on the evolutionary operators, improving the population based on the development set.
Hardware Specification	No	The paper does not explicitly describe the hardware (e.g., specific GPU models, CPU types, or cloud instance specifications) used to run the experiments.
Software Dependencies	No	The paper does not provide specific version numbers for ancillary software dependencies (e.g., Python, PyTorch, or other libraries).
Experiment Setup	Yes	The parameters for the experiments are shown in Table 11. For evolutionary algorithms implemented by GPT-3.5... we use Top-p decoding (temperature=0.5, P = 0.95). For the task implementation, we use greedy decoding and the default temperature for Alpaca. For the generation tasks implemented by GPT-3.5, the temperature is 0.0.