Efficient Prompt Optimization Through the Lens of Best Arm Identification
Authors: Chengshuai Shi, Kun Yang, Zihan Chen, Jundong Li, Jing Yang, Cong Shen
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on multiple well-adopted tasks using various LLMs demonstrate the remarkable performance improvement of TRIPLE over baselines while satisfying the limited budget constraints. |
| Researcher Affiliation | Academia | Chengshuai Shi University of Virginia cs7ync@virginia.edu Kun Yang University of Virginia ky9tc@virginia.edu Zihan Chen University of Virginia brf3rx@virginia.edu Jundong Li University of Virginia jundong@virginia.edu Jing Yang The Pennsylvania State University yangjing@psu.edu Cong Shen University of Virginia cong@virginia.edu |
| Pseudocode | Yes | Their complete descriptions are provided in Algs. 2 and 3 of Appendix C. Algorithm 1 TRIPLE-CLST... Algorithm 3 TRIPLE-CR... Algorithm 4 TRIPLE-GSE... Algorithm 5 TRIPLE-CSAR... Algorithm 6 TRIPLE-SAR |
| Open Source Code | Yes | The experimental codes can be found at https://github.com/Shen Group/TRIPLE. |
| Open Datasets | Yes | Extensive experimental results are reported to evaluate the efficiency of TRIPLE across diverse prompting tasks from two standard datasets: Instruction-Induction [30] and Big Bench [69]. |
| Dataset Splits | Yes | Furthermore, to avoid overfitting and convergence issues, we adopt the standard approach by dividing our interaction data into training (80%) and validation (20%) sets. |
| Hardware Specification | Yes | We use a workstation with two Nvidia-A6000 Ada GPUs for all experiments using white-box LLMs (i.e., Llama2, Mistral, and Gemma). |
| Software Dependencies | No | The paper mentions specific LLM models (GPT-3.5: gpt-3.5-turbo-1106, Llama2: Llama2-7b, Gemma: Gemma-7b, Mistral: Mistral-7B-v0.2) and OpenAI components (cl100k_base tokenizer, text-embedding-ada-002 model). While these are specific tools, the paper does not list broader software dependencies with explicit version numbers (e.g., Python version, PyTorch/TensorFlow version, CUDA version, or other general libraries) that would be needed to replicate the entire experimental environment. |
| Experiment Setup | Yes | In experiments with TRIPLE-CLST, the number of clusters is set as L = p|P| and a third of our total budget is allocated for the initial phase, i.e., N1 = N/3... For the APO framework... we set {num_feedback} to 2 and {num_prompts} to 5... in the implementation of TRIPLE-GSE, we first employ a projection to 64 dimensions... we set this error threshold at 0.1 in our experiments. |