One Prompt is not Enough: Automated Construction of a Mixture-of-Expert Prompts

Authors: Ruochen Wang, Sohyun An, Minhao Cheng, Tianyi Zhou, Sung Ju Hwang, Cho-Jui Hsieh

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We scrutinize the proposed Mixture-of-Prompts (Mo P) through extensive empirical study. Our key findings can be summarized as follows: (1) Clustering demos in the embedding space can effectively find semantically similar clusters that help allocate test samples accurately to the corresponding region and the optimal expert. (2) More experts are not necessarily better: there exists an optimal number of partitions for the problem space. (3) The optimal instruction for each demo cluster is often distinct, necessitating the joint search of demo and instructions. We further validate the strength of Mo P across three major prompt optimization benchmarks: Instruction-Induction (Honovich et al., 2022), Super Natural Instructions (Wang et al., 2022b), and BIGBench-Hard (Suzgun et al., 2022).
Researcher Affiliation Academia 1University of California, Los Angeles 2Korea Advanced Institute of Science & Technology 3Penn State University 4University of Maryland, College Park.
Pseudocode Yes Appendix B. Algorithms for Mo P: Algorithm 1 Building Mo P
Open Source Code No The paper does not include an unambiguous statement that the authors are releasing their source code, nor does it provide a direct link to a code repository for the methodology described.
Open Datasets Yes We further validate the strength of Mo P across three major prompt optimization benchmarks: Instruction-Induction (Honovich et al., 2022), Super Natural Instructions (Wang et al., 2022b), and BIGBench-Hard (Suzgun et al., 2022).
Dataset Splits Yes Concretely, given a set of demos sampled from a task distribution D, analog to the training data in supervised learning, a prompt optimization aims at finding an Instruction (demo-free) that minimizes the empirical risk (or maximizes a score): ... APE ... pick the best one according to their rank on a held-out validation set (partitioned from training demos as well)." and "To select the best instruction from a set of proposals, existing prompt optimization algorithms commonly rank their performance on a held-out validation set, sampled from the same distribution as the training demos.
Hardware Specification Yes The results are measured on the Instruction Induction Benchmark, using the same hardware (1 48G A6000) and API model (gpt-3.5-turbo-instruct).
Software Dependencies No The paper mentions specific OpenAI models (GPT-3.5Turbo-Instruct, text-embedding-ada-002) but does not list ancillary software dependencies with specific version numbers (e.g., Python libraries, frameworks like PyTorch or TensorFlow).
Experiment Setup Yes Following the previous automatic prompting works (Zhou et al., 2022), we set the temperature to 0.9 when generating instructions using LLMs to encourage diversity and to 0.0 when evaluating with LLMs. Furthermore, for a fair comparison, we set the same budget for all methods. Regarding demo assignments in Mo P, we use the default hyperparameter consistently across all experiments. ... Regarding hyperparameters, the α value in Equation (10) is set to the default value of 0.02 and remained the same across all experiments.