Efficient multi-prompt evaluation of LLMs
Authors: Felipe Maia Polo, Ronald Xu, Lucas Weber, Mírian Silva, Onkar Bhardwaj, Leshem Choshen, Allysson de Oliveira, Yuekai Sun, Mikhail Yurochkin
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We prove that Prompt Eval consistently estimates the performance distribution and demonstrate its efficacy empirically on three prominent LLM benchmarks: MMLU, BIG-bench Hard, and LMentry; for example, Prompt Eval can accurately estimate performance quantiles across 100 prompt templates on MMLU with a budget equivalent to two single-prompt evaluations. |
| Researcher Affiliation | Collaboration | 1University of Michigan, 2MIT, 3University Pompeu Fabra, 4Federal University of Minas Gerais 5IBM Research, 6MIT-IBM Watson AI Lab |
| Pseudocode | Yes | Algorithm 1: Prompt Eval |
| Open Source Code | Yes | Our code can be found in https://github.com/felipemaiapolo/prompteval |
| Open Datasets | Yes | We use data derived from three popular benchmarks: MMLU [Hendrycks et al., 2020], BIG-bench Hard (BBH) [Suzgun et al., 2022], and LMentry [Efrat et al., 2022]. (...) and the MMLU data can be found in https://huggingface.co/Prompt Eval. |
| Dataset Splits | Yes | Additionally, the training data are split along the example axis into an 80% training and 20% validation set. |
| Hardware Specification | Yes | All experiments were conducted using a virtual machine with 32 cores. (...) we employ multiple NVIDIA A30 GPUs with 24 GB v RAM |
| Software Dependencies | No | The paper mentions software like Adam optimizer, BERT, and sentence transformers but does not specify their version numbers, which is required for a reproducible description of ancillary software. |
| Experiment Setup | Yes | We employ the Adam optimizer [Kingma and Ba, 2014] with an initial learning rate of 2e-5 and a weight decay of 1e-5. The learning rate undergoes a linear warm-up over 200 steps, followed by exponential decay using the formula lrcurrnt = γs lrinit, where s is the number of steps after the warmup phase and the decay factor γ is set to 0.99995. We train with a batch size of 96. |