Large Language Models are Human-Level Prompt Engineers
Authors: Yongchao Zhou, Andrei Ioan Muresanu, Ziwen Han, Keiran Paster, Silviu Pitis, Harris Chan, Jimmy Ba
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments show that our automatically generated instructions outperform the prior LLM baseline by a large margin and achieve better or comparable performance to the instructions generated by human annotators on 24/24 Instruction Induction tasks and 17/21 curated BIG-Bench tasks. We conduct extensive qualitative and quantitative analyses to explore the performance of APE. |
| Researcher Affiliation | Academia | Yongchao Zhou1,2, , Andrei Ioan Muresanu2,3, , Ziwen Han1,2, , Keiran Paster1,2, Silviu Pitis1,2, Harris Chan1,2, Jimmy Ba1,2 1University of Toronto 2Vector Institute 3University of Waterloo |
| Pseudocode | Yes | Algorithm 1 Automatic Prompt Engineer (APE) |
| Open Source Code | Yes | Our code is available at https://github.com/keirp/automatic_prompt_engineer. |
| Open Datasets | Yes | We assess the effectiveness of zero-shot and few-shot in-context learning on 24 instruction induction tasks proposed in Honovich et al. (2022). ... To see whether APE can be applied to more challenging tasks, we propose and curate BIG-Bench Instruction Induction (BBII), a clean and tractable subset of 21 tasks... |
| Dataset Splits | Yes | To reduce the computation cost, we adopt a filtering scheme where a promising candidate receives more computation resources while a low-quality candidate receives less computation. It can be achieved by using a multi-stage computation strategy on lines 2-9 Algorithm 1. We first evaluate all candidates with a small subset of the training dataset. For the candidates with a score greater than a certain threshold, we sample and evaluate a new non-overlapping subset from the training dataset to update the moving average of the score. ... We also measure the number of tokens used to score 250 generated instructions on ten validation input-output pairs on Instruct GPT (i.e., text-davinci-002). |
| Hardware Specification | No | The paper states 'We use the text-davinci-002 via the Open AI API' and refers to various models 'available via the Open AI API', indicating that experiments were run by querying an external service rather than on explicitly detailed local hardware. |
| Software Dependencies | No | The paper mentions using the 'Open AI API' and specific models like 'text-davinci-002', 'GPT-3', 'Instruct GPT', 'T5', 'GLM', 'Insert GPT', and 'OPT-175B'. However, it does not specify version numbers for programming languages or software libraries like Python, PyTorch, or TensorFlow. |
| Experiment Setup | Yes | For each task, we sample five input-output pairs from the training data and select the best instruction using algorithm 1. Then, we evaluate the quality of the instruction by executing the instruction on Instruct GPT 3. We repeat our experiments five times with different random seeds to report the mean and standard deviation. ... Thus, we choose 50 as our default sample size. |