Large Language Models are Human-Level Prompt Engineers

Authors: Yongchao Zhou, Andrei Ioan Muresanu, Ziwen Han, Keiran Paster, Silviu Pitis, Harris Chan, Jimmy Ba

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments show that our automatically generated instructions outperform the prior LLM baseline by a large margin and achieve better or comparable performance to the instructions generated by human annotators on 24/24 Instruction Induction tasks and 17/21 curated BIG-Bench tasks. We conduct extensive qualitative and quantitative analyses to explore the performance of APE.
Researcher Affiliation Academia Yongchao Zhou1,2, , Andrei Ioan Muresanu2,3, , Ziwen Han1,2, , Keiran Paster1,2, Silviu Pitis1,2, Harris Chan1,2, Jimmy Ba1,2 1University of Toronto 2Vector Institute 3University of Waterloo
Pseudocode Yes Algorithm 1 Automatic Prompt Engineer (APE)
Open Source Code Yes Our code is available at https://github.com/keirp/automatic_prompt_engineer.
Open Datasets Yes We assess the effectiveness of zero-shot and few-shot in-context learning on 24 instruction induction tasks proposed in Honovich et al. (2022). ... To see whether APE can be applied to more challenging tasks, we propose and curate BIG-Bench Instruction Induction (BBII), a clean and tractable subset of 21 tasks...
Dataset Splits Yes To reduce the computation cost, we adopt a filtering scheme where a promising candidate receives more computation resources while a low-quality candidate receives less computation. It can be achieved by using a multi-stage computation strategy on lines 2-9 Algorithm 1. We first evaluate all candidates with a small subset of the training dataset. For the candidates with a score greater than a certain threshold, we sample and evaluate a new non-overlapping subset from the training dataset to update the moving average of the score. ... We also measure the number of tokens used to score 250 generated instructions on ten validation input-output pairs on Instruct GPT (i.e., text-davinci-002).
Hardware Specification No The paper states 'We use the text-davinci-002 via the Open AI API' and refers to various models 'available via the Open AI API', indicating that experiments were run by querying an external service rather than on explicitly detailed local hardware.
Software Dependencies No The paper mentions using the 'Open AI API' and specific models like 'text-davinci-002', 'GPT-3', 'Instruct GPT', 'T5', 'GLM', 'Insert GPT', and 'OPT-175B'. However, it does not specify version numbers for programming languages or software libraries like Python, PyTorch, or TensorFlow.
Experiment Setup Yes For each task, we sample five input-output pairs from the training data and select the best instruction using algorithm 1. Then, we evaluate the quality of the instruction by executing the instruction on Instruct GPT 3. We repeat our experiments five times with different random seeds to report the mean and standard deviation. ... Thus, we choose 50 as our default sample size.