reproducibilityindex.ai

Efficient Prompt Optimization Through the Lens of Best Arm Identification

Authors: Chengshuai Shi, Kun Yang, Zihan Chen, Jundong Li, Jing Yang, Cong Shen

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on multiple well-adopted tasks using various LLMs demonstrate the remarkable performance improvement of TRIPLE over baselines while satisfying the limited budget constraints.
Researcher Affiliation	Academia	Chengshuai Shi University of Virginia cs7ync@virginia.edu Kun Yang University of Virginia ky9tc@virginia.edu Zihan Chen University of Virginia brf3rx@virginia.edu Jundong Li University of Virginia jundong@virginia.edu Jing Yang The Pennsylvania State University yangjing@psu.edu Cong Shen University of Virginia cong@virginia.edu
Pseudocode	Yes	Their complete descriptions are provided in Algs. 2 and 3 of Appendix C. Algorithm 1 TRIPLE-CLST... Algorithm 3 TRIPLE-CR... Algorithm 4 TRIPLE-GSE... Algorithm 5 TRIPLE-CSAR... Algorithm 6 TRIPLE-SAR
Open Source Code	Yes	The experimental codes can be found at https://github.com/Shen Group/TRIPLE.
Open Datasets	Yes	Extensive experimental results are reported to evaluate the efficiency of TRIPLE across diverse prompting tasks from two standard datasets: Instruction-Induction [30] and Big Bench [69].
Dataset Splits	Yes	Furthermore, to avoid overfitting and convergence issues, we adopt the standard approach by dividing our interaction data into training (80%) and validation (20%) sets.
Hardware Specification	Yes	We use a workstation with two Nvidia-A6000 Ada GPUs for all experiments using white-box LLMs (i.e., Llama2, Mistral, and Gemma).
Software Dependencies	No	The paper mentions specific LLM models (GPT-3.5: gpt-3.5-turbo-1106, Llama2: Llama2-7b, Gemma: Gemma-7b, Mistral: Mistral-7B-v0.2) and OpenAI components (cl100k_base tokenizer, text-embedding-ada-002 model). While these are specific tools, the paper does not list broader software dependencies with explicit version numbers (e.g., Python version, PyTorch/TensorFlow version, CUDA version, or other general libraries) that would be needed to replicate the entire experimental environment.
Experiment Setup	Yes	In experiments with TRIPLE-CLST, the number of clusters is set as L = p\|P\| and a third of our total budget is allocated for the initial phase, i.e., N1 = N/3... For the APO framework... we set {num_feedback} to 2 and {num_prompts} to 5... in the implementation of TRIPLE-GSE, we first employ a projection to 64 dimensions... we set this error threshold at 0.1 in our experiments.