Large Language Models as Optimizers

Authors: Chengrun Yang, Xuezhi Wang, Yifeng Lu, Hanxiao Liu, Quoc V Le, Denny Zhou, Xinyun Chen

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We first showcase OPRO on linear regression and traveling salesman problems, then move on to our main application in prompt optimization, where the goal is to find instructions that maximize the task accuracy. With a variety of LLMs, we demonstrate that the best prompts optimized by OPRO outperform human-designed prompts by up to 8% on GSM8K, and by up to 50% on Big-Bench Hard tasks. Code at https://github.com/google-deepmind/opro.
Researcher Affiliation Industry Google DeepMind
Pseudocode No The paper describes the OPRO framework conceptually and with figures, but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes Code at https://github.com/google-deepmind/opro.
Open Datasets Yes We optimize prompts on GSM8K (Cobbe et al., 2021) and Big-Bench Hard (Suzgun et al., 2022), which are reasoning benchmarks where prompting techniques have achieved remarkable performance breakthrough (Wei et al., 2022; Kojima et al., 2022; Suzgun et al., 2022). To examine the transferability of the optimized instructions, we also evaluate the instructions optimized for GSM8K on two other mathematical reasoning datasets, i.e., Multi Arith (Roy & Roth, 2016) and AQu A (Ling et al., 2017).
Dataset Splits Yes For GSM8K, we randomly sample 3.5% examples from the training set, and the same subset is used throughout optimization. This balances evaluation cost with generalization performance. After the optimization finishes, we evaluate the found instructions on the entire GSM8K test set.
Hardware Specification No The paper mentions various LLMs (e.g., Pa LM 2-L, text-bison, gpt-3.5-turbo, gpt-4) but does not specify the underlying hardware (e.g., GPU models, CPU types, or cloud compute specifications) used for running the experiments.
Software Dependencies No The paper states which LLM APIs were used:
Experiment Setup Yes We use temperature 0 (greedy decoding) to evaluate generated instructions. We set the default temperature to 1.0 for optimizer LLMs to generate diverse instructions. In each step, we prompt the optimizer LLM with the meta-prompt 8 times to generate 8 instructions. Our meta-prompt contains the best 20 instructions so far and 3 randomly picked exemplars from the training set. The 3 exemplars are independently sampled at each step to reduce the risk of overfitting on the given exemplars.