Prompt Optimization with EASE? Efficient Ordering-aware Automated Selection of Exemplars

Authors: Zhaoxuan Wu, Xiaoqiang Lin, Zhongxiang Dai, Wenyang Hu, Yao Shu, See-Kiong Ng, Patrick Jaillet, Bryan Kian Hsiang Low

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through extensive empirical evaluations (including novel tasks), we demonstrate the superiority of EASE over existing methods
Researcher Affiliation Academia Institute of Data Science, National University of Singapore, Republic of Singapore Singapore-MIT Alliance for Research and Technology, Republic of Singapore Dept. of Computer Science, National University of Singapore, Republic of Singapore School of Data Science, The Chinese University of Hong Kong, Shenzhenζ LIDS and EECS, Massachusetts Institute of Technology, USA Guangdong Lab of AI and Digital Economy (SZ)
Pseudocode Yes Algorithm 1: EASE
Open Source Code Yes Our code is available at https://github. com/Zhaoxuan Wu/EASE-Prompt-Optimization.
Open Datasets Yes We test on tasks that contain more than 100 training examples for selection. We compare EASE with the following comprehensive suite of baselines. They include subset selection methods with (a) a determinantal point process (DPP) metric adapted from [48], and (b) maximum mean discrepancy (MMD) [39], (c) optimal transport (OT) [42] metrics. We also adapt retrievalbased methods to our setting using a new retrieve-then-sample strategy based on the validation set (details in App. B.2 and Sec. 4.5), specifically with the classical (d) Cosine similarity and (e) BM25 [37] retrievers. We also compare with existing exemplar selection baselines using (f) an active selection policy learned using reinforcement learning (Active) [50], and (g) an exemplar influence metric (Inf) [31]. Additionally, we propose two more new strong baselines: (h) Evo which mutates exemplars through evolutionary strategies and (i) Best-of-N which explores the exemplar space uniformly until the query budget is exhausted. Evo is similar to Phase Evo proposed by Cui et al. [8]. Best-of-N shares similarity with DSPy s Bootstrap Few Shot with Random Search [17]. More implementation details are found in App. B.2. The number of exemplars in the in-context prompt is set to k = 5. The black-box query budget is 165 evaluations following Lin et al. [21]. Since the effectiveness of the optimization is directly reflected by the value of the objective function in (1), we report the validation accuracy in the following experiments unless otherwise specified. The test accuracy tables are presented in App. C.15.
Dataset Splits Yes where s( , ) is a score function for the output against the ground truth, DV is the held-out validation set and 's V (E) = 1/|DV | P (x,y) DV s(f(E, x), y)'. The validation set only contains 20 data points (limited by the fact that querying the GPT-3.5 API is expensive).
Hardware Specification Yes All experiments are conducted on a server with Intel(R) Xeon(R) CPU and NVIDIA H100 GPUs.
Software Dependencies No The paper states 'we use gpt-3.5-turbo-1106 as the target black-box model, and MPNet as the embedding model' but does not specify version numbers for general software libraries or tools used in the experiments.
Experiment Setup Yes The number of exemplars in the in-context prompt is set to k = 5. The black-box query budget is 165 evaluations following Lin et al. [21]. We use a sampling size of q = 50000 exemplar permutations per iteration after OT is introduced. One important hyperparameter of the Neural UCB algorithm utilized in the paper is ν (see (3)), which controls the degree of exploration performed during the optimization process.