reproducibilityindex.ai

One Prompt is not Enough: Automated Construction of a Mixture-of-Expert Prompts

Authors: Ruochen Wang, Sohyun An, Minhao Cheng, Tianyi Zhou, Sung Ju Hwang, Cho-Jui Hsieh

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We scrutinize the proposed Mixture-of-Prompts (Mo P) through extensive empirical study. Our key findings can be summarized as follows: (1) Clustering demos in the embedding space can effectively find semantically similar clusters that help allocate test samples accurately to the corresponding region and the optimal expert. (2) More experts are not necessarily better: there exists an optimal number of partitions for the problem space. (3) The optimal instruction for each demo cluster is often distinct, necessitating the joint search of demo and instructions. We further validate the strength of Mo P across three major prompt optimization benchmarks: Instruction-Induction (Honovich et al., 2022), Super Natural Instructions (Wang et al., 2022b), and BIGBench-Hard (Suzgun et al., 2022).
Researcher Affiliation	Academia	1University of California, Los Angeles 2Korea Advanced Institute of Science & Technology 3Penn State University 4University of Maryland, College Park.
Pseudocode	Yes	Appendix B. Algorithms for Mo P: Algorithm 1 Building Mo P
Open Source Code	No	The paper does not include an unambiguous statement that the authors are releasing their source code, nor does it provide a direct link to a code repository for the methodology described.
Open Datasets	Yes	We further validate the strength of Mo P across three major prompt optimization benchmarks: Instruction-Induction (Honovich et al., 2022), Super Natural Instructions (Wang et al., 2022b), and BIGBench-Hard (Suzgun et al., 2022).
Dataset Splits	Yes	Concretely, given a set of demos sampled from a task distribution D, analog to the training data in supervised learning, a prompt optimization aims at finding an Instruction (demo-free) that minimizes the empirical risk (or maximizes a score): ... APE ... pick the best one according to their rank on a held-out validation set (partitioned from training demos as well)." and "To select the best instruction from a set of proposals, existing prompt optimization algorithms commonly rank their performance on a held-out validation set, sampled from the same distribution as the training demos.
Hardware Specification	Yes	The results are measured on the Instruction Induction Benchmark, using the same hardware (1 48G A6000) and API model (gpt-3.5-turbo-instruct).
Software Dependencies	No	The paper mentions specific OpenAI models (GPT-3.5Turbo-Instruct, text-embedding-ada-002) but does not list ancillary software dependencies with specific version numbers (e.g., Python libraries, frameworks like PyTorch or TensorFlow).
Experiment Setup	Yes	Following the previous automatic prompting works (Zhou et al., 2022), we set the temperature to 0.9 when generating instructions using LLMs to encourage diversity and to 0.0 when evaluating with LLMs. Furthermore, for a fair comparison, we set the same budget for all methods. Regarding demo assignments in Mo P, we use the default hyperparameter consistently across all experiments. ... Regarding hyperparameters, the α value in Equation (10) is set to the default value of 0.02 and remained the same across all experiments.