Large Language Models as Optimizers
Authors: Chengrun Yang, Xuezhi Wang, Yifeng Lu, Hanxiao Liu, Quoc V Le, Denny Zhou, Xinyun Chen
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We first showcase OPRO on linear regression and traveling salesman problems, then move on to our main application in prompt optimization, where the goal is to find instructions that maximize the task accuracy. With a variety of LLMs, we demonstrate that the best prompts optimized by OPRO outperform human-designed prompts by up to 8% on GSM8K, and by up to 50% on Big-Bench Hard tasks. Code at https://github.com/google-deepmind/opro. |
| Researcher Affiliation | Industry | Google DeepMind |
| Pseudocode | No | The paper describes the OPRO framework conceptually and with figures, but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code at https://github.com/google-deepmind/opro. |
| Open Datasets | Yes | We optimize prompts on GSM8K (Cobbe et al., 2021) and Big-Bench Hard (Suzgun et al., 2022), which are reasoning benchmarks where prompting techniques have achieved remarkable performance breakthrough (Wei et al., 2022; Kojima et al., 2022; Suzgun et al., 2022). To examine the transferability of the optimized instructions, we also evaluate the instructions optimized for GSM8K on two other mathematical reasoning datasets, i.e., Multi Arith (Roy & Roth, 2016) and AQu A (Ling et al., 2017). |
| Dataset Splits | Yes | For GSM8K, we randomly sample 3.5% examples from the training set, and the same subset is used throughout optimization. This balances evaluation cost with generalization performance. After the optimization finishes, we evaluate the found instructions on the entire GSM8K test set. |
| Hardware Specification | No | The paper mentions various LLMs (e.g., Pa LM 2-L, text-bison, gpt-3.5-turbo, gpt-4) but does not specify the underlying hardware (e.g., GPU models, CPU types, or cloud compute specifications) used for running the experiments. |
| Software Dependencies | No | The paper states which LLM APIs were used: |
| Experiment Setup | Yes | We use temperature 0 (greedy decoding) to evaluate generated instructions. We set the default temperature to 1.0 for optimizer LLMs to generate diverse instructions. In each step, we prompt the optimizer LLM with the meta-prompt 8 times to generate 8 instructions. Our meta-prompt contains the best 20 instructions so far and 3 randomly picked exemplars from the training set. The 3 exemplars are independently sampled at each step to reduce the risk of overfitting on the given exemplars. |