reproducibilityindex.ai

Efficient multi-prompt evaluation of LLMs

Authors: Felipe Maia Polo, Ronald Xu, Lucas Weber, Mírian Silva, Onkar Bhardwaj, Leshem Choshen, Allysson de Oliveira, Yuekai Sun, Mikhail Yurochkin

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We prove that Prompt Eval consistently estimates the performance distribution and demonstrate its efficacy empirically on three prominent LLM benchmarks: MMLU, BIG-bench Hard, and LMentry; for example, Prompt Eval can accurately estimate performance quantiles across 100 prompt templates on MMLU with a budget equivalent to two single-prompt evaluations.
Researcher Affiliation	Collaboration	1University of Michigan, 2MIT, 3University Pompeu Fabra, 4Federal University of Minas Gerais 5IBM Research, 6MIT-IBM Watson AI Lab
Pseudocode	Yes	Algorithm 1: Prompt Eval
Open Source Code	Yes	Our code can be found in https://github.com/felipemaiapolo/prompteval
Open Datasets	Yes	We use data derived from three popular benchmarks: MMLU [Hendrycks et al., 2020], BIG-bench Hard (BBH) [Suzgun et al., 2022], and LMentry [Efrat et al., 2022]. (...) and the MMLU data can be found in https://huggingface.co/Prompt Eval.
Dataset Splits	Yes	Additionally, the training data are split along the example axis into an 80% training and 20% validation set.
Hardware Specification	Yes	All experiments were conducted using a virtual machine with 32 cores. (...) we employ multiple NVIDIA A30 GPUs with 24 GB v RAM
Software Dependencies	No	The paper mentions software like Adam optimizer, BERT, and sentence transformers but does not specify their version numbers, which is required for a reproducible description of ancillary software.
Experiment Setup	Yes	We employ the Adam optimizer [Kingma and Ba, 2014] with an initial learning rate of 2e-5 and a weight decay of 1e-5. The learning rate undergoes a linear warm-up over 200 steps, followed by exponential decay using the formula lrcurrnt = γs lrinit, where s is the number of steps after the warmup phase and the decay factor γ is set to 0.99995. We train with a batch size of 96.