Efficient multi-prompt evaluation of LLMs

Authors: Felipe Maia Polo, Ronald Xu, Lucas Weber, Mírian Silva, Onkar Bhardwaj, Leshem Choshen, Allysson de Oliveira, Yuekai Sun, Mikhail Yurochkin

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We prove that Prompt Eval consistently estimates the performance distribution and demonstrate its efficacy empirically on three prominent LLM benchmarks: MMLU, BIG-bench Hard, and LMentry; for example, Prompt Eval can accurately estimate performance quantiles across 100 prompt templates on MMLU with a budget equivalent to two single-prompt evaluations.
Researcher Affiliation Collaboration 1University of Michigan, 2MIT, 3University Pompeu Fabra, 4Federal University of Minas Gerais 5IBM Research, 6MIT-IBM Watson AI Lab
Pseudocode Yes Algorithm 1: Prompt Eval
Open Source Code Yes Our code can be found in https://github.com/felipemaiapolo/prompteval
Open Datasets Yes We use data derived from three popular benchmarks: MMLU [Hendrycks et al., 2020], BIG-bench Hard (BBH) [Suzgun et al., 2022], and LMentry [Efrat et al., 2022]. (...) and the MMLU data can be found in https://huggingface.co/Prompt Eval.
Dataset Splits Yes Additionally, the training data are split along the example axis into an 80% training and 20% validation set.
Hardware Specification Yes All experiments were conducted using a virtual machine with 32 cores. (...) we employ multiple NVIDIA A30 GPUs with 24 GB v RAM
Software Dependencies No The paper mentions software like Adam optimizer, BERT, and sentence transformers but does not specify their version numbers, which is required for a reproducible description of ancillary software.
Experiment Setup Yes We employ the Adam optimizer [Kingma and Ba, 2014] with an initial learning rate of 2e-5 and a weight decay of 1e-5. The learning rate undergoes a linear warm-up over 200 steps, followed by exponential decay using the formula lrcurrnt = γs lrinit, where s is the number of steps after the warmup phase and the decay factor γ is set to 0.99995. We train with a batch size of 96.