Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Efficient Prompt Optimization Through the Lens of Best Arm Identification
Authors: Chengshuai Shi, Kun Yang, Zihan Chen, Jundong Li, Jing Yang, Cong Shen
NeurIPS 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on multiple well-adopted tasks using various LLMs demonstrate the remarkable performance improvement of TRIPLE over baselines while satisfying the limited budget constraints. |
| Researcher Affiliation | Academia | Chengshuai Shi University of Virginia EMAIL Kun Yang University of Virginia EMAIL Zihan Chen University of Virginia EMAIL Jundong Li University of Virginia EMAIL Jing Yang The Pennsylvania State University EMAIL Cong Shen University of Virginia EMAIL |
| Pseudocode | Yes | Their complete descriptions are provided in Algs. 2 and 3 of Appendix C. Algorithm 1 TRIPLE-CLST... Algorithm 3 TRIPLE-CR... Algorithm 4 TRIPLE-GSE... Algorithm 5 TRIPLE-CSAR... Algorithm 6 TRIPLE-SAR |
| Open Source Code | Yes | The experimental codes can be found at https://github.com/Shen Group/TRIPLE. |
| Open Datasets | Yes | Extensive experimental results are reported to evaluate the efficiency of TRIPLE across diverse prompting tasks from two standard datasets: Instruction-Induction [30] and Big Bench [69]. |
| Dataset Splits | Yes | Furthermore, to avoid overfitting and convergence issues, we adopt the standard approach by dividing our interaction data into training (80%) and validation (20%) sets. |
| Hardware Specification | Yes | We use a workstation with two Nvidia-A6000 Ada GPUs for all experiments using white-box LLMs (i.e., Llama2, Mistral, and Gemma). |
| Software Dependencies | No | The paper mentions specific LLM models (GPT-3.5: gpt-3.5-turbo-1106, Llama2: Llama2-7b, Gemma: Gemma-7b, Mistral: Mistral-7B-v0.2) and OpenAI components (cl100k_base tokenizer, text-embedding-ada-002 model). While these are specific tools, the paper does not list broader software dependencies with explicit version numbers (e.g., Python version, PyTorch/TensorFlow version, CUDA version, or other general libraries) that would be needed to replicate the entire experimental environment. |
| Experiment Setup | Yes | In experiments with TRIPLE-CLST, the number of clusters is set as L = p|P| and a third of our total budget is allocated for the initial phase, i.e., N1 = N/3... For the APO framework... we set {num_feedback} to 2 and {num_prompts} to 5... in the implementation of TRIPLE-GSE, we first employ a projection to 64 dimensions... we set this error threshold at 0.1 in our experiments. |