Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers

Authors: Qingyan Guo, Rui Wang, Junliang Guo, Bei Li, Kaitao Song, Xu Tan, Guoqing Liu, Jiang Bian, Yujiu Yang

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We optimize prompts for both closedand open-source LLMs including GPT-3.5 and Alpaca, on 31 datasets covering language understanding, generation tasks, as well as BIG-Bench Hard (BBH) tasks. EVOPROMPT significantly outperforms human-engineered prompts and existing methods for automatic prompt generation (e.g., up to 25% on BBH).
Researcher Affiliation Collaboration 1Tsinghua University 2Microsoft Research 3Northeastern University
Pseudocode Yes Algorithm 1 Discrete prompt optimization: EVOPROMPT
Open Source Code Yes Our code is available at https://github.com/beeevita/Evo Prompt.
Open Datasets Yes We first conduct experiments on language understanding tasks across 7 datasets to validate our methods, including sentiment classification (SST-2 (Socher et al., 2013), MR (PANG, 2005), CR (Hu & Liu, 2004), SST-5 (Socher et al., 2013)), topic classification (AG s News (Zhang et al., 2015), TREC (Voorhees & Tice, 2000)) and subjectivity classification (Subj (Pang & Lee, 2004)). For summarization, we adopt SAMSum (Gliwa et al., 2019)... for text simplification... we employ the ASSET dataset (Alva-Manchego et al., 2020)... we apply BBH (Suzgun et al., 2022).
Dataset Splits Yes Specifically, abstaining from any gradients or parameters, EVOPROMPT starts from a population of prompts and iteratively generates new prompts with LLMs based on the evolutionary operators, improving the population based on the development set.
Hardware Specification No The paper does not explicitly describe the hardware (e.g., specific GPU models, CPU types, or cloud instance specifications) used to run the experiments.
Software Dependencies No The paper does not provide specific version numbers for ancillary software dependencies (e.g., Python, PyTorch, or other libraries).
Experiment Setup Yes The parameters for the experiments are shown in Table 11. For evolutionary algorithms implemented by GPT-3.5... we use Top-p decoding (temperature=0.5, P = 0.95). For the task implementation, we use greedy decoding and the default temperature for Alpaca. For the generation tasks implemented by GPT-3.5, the temperature is 0.0.