reproducibilityindex.ai

Large Language Models as Optimizers

Authors: Chengrun Yang, Xuezhi Wang, Yifeng Lu, Hanxiao Liu, Quoc V Le, Denny Zhou, Xinyun Chen

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We ﬁrst showcase OPRO on linear regression and traveling salesman problems, then move on to our main application in prompt optimization, where the goal is to ﬁnd instructions that maximize the task accuracy. With a variety of LLMs, we demonstrate that the best prompts optimized by OPRO outperform human-designed prompts by up to 8% on GSM8K, and by up to 50% on Big-Bench Hard tasks. Code at https://github.com/google-deepmind/opro.
Researcher Affiliation	Industry	Google DeepMind
Pseudocode	No	The paper describes the OPRO framework conceptually and with figures, but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Code at https://github.com/google-deepmind/opro.
Open Datasets	Yes	We optimize prompts on GSM8K (Cobbe et al., 2021) and Big-Bench Hard (Suzgun et al., 2022), which are reasoning benchmarks where prompting techniques have achieved remarkable performance breakthrough (Wei et al., 2022; Kojima et al., 2022; Suzgun et al., 2022). To examine the transferability of the optimized instructions, we also evaluate the instructions optimized for GSM8K on two other mathematical reasoning datasets, i.e., Multi Arith (Roy & Roth, 2016) and AQu A (Ling et al., 2017).
Dataset Splits	Yes	For GSM8K, we randomly sample 3.5% examples from the training set, and the same subset is used throughout optimization. This balances evaluation cost with generalization performance. After the optimization ﬁnishes, we evaluate the found instructions on the entire GSM8K test set.
Hardware Specification	No	The paper mentions various LLMs (e.g., Pa LM 2-L, text-bison, gpt-3.5-turbo, gpt-4) but does not specify the underlying hardware (e.g., GPU models, CPU types, or cloud compute specifications) used for running the experiments.
Software Dependencies	No	The paper states which LLM APIs were used:
Experiment Setup	Yes	We use temperature 0 (greedy decoding) to evaluate generated instructions. We set the default temperature to 1.0 for optimizer LLMs to generate diverse instructions. In each step, we prompt the optimizer LLM with the meta-prompt 8 times to generate 8 instructions. Our meta-prompt contains the best 20 instructions so far and 3 randomly picked exemplars from the training set. The 3 exemplars are independently sampled at each step to reduce the risk of overﬁtting on the given exemplars.