MixEval: Deriving Wisdom of the Crowd from LLM Benchmark Mixtures
Authors: Jinjie Ni, Fuzhao Xue, Xiang Yue, Yuntian Deng, Mahir Shah, Kabir Jain, Graham Neubig, Yang You
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Evaluating large language models (LLMs) is challenging. Traditional ground-truthbased benchmarks fail to capture the comprehensiveness and nuance of real-world queries, while LLM-as-judge benchmarks suffer from grading biases and limited query quantity. Both of them may also become contaminated over time. Userfacing evaluation, such as Chatbot Arena, provides reliable signals but is costly and slow. In this work, we propose Mix Eval, a new paradigm for establishing efficient, gold-standard LLM evaluation by strategically mixing off-the-shelf benchmarks. It bridges (1) comprehensive and well-distributed real-world user queries and (2) efficient and fairly-graded ground-truth-based benchmarks, by matching queries mined from the web with similar queries from existing benchmarks. Based on Mix Eval, we further build Mix Eval-Hard, which offers more room for model improvement. Our benchmarks advantages lie in (1) a 0.96 model ranking correlation with Chatbot Arena arising from the highly impartial query distribution and grading mechanism, (2) fast, cheap, and reproducible execution (6% of the time and cost of MMLU), and (3) dynamic evaluation enabled by the rapid and stable data update pipeline. We provide extensive meta-evaluation and analysis for our and existing LLM benchmarks to deepen the community s understanding of LLM evaluation and guide future research directions. |
| Researcher Affiliation | Collaboration | National University of Singapore, Carnegie Mellon University, Allen Institute for AI |
| Pseudocode | No | No explicit pseudocode or algorithm blocks were found in the paper. |
| Open Source Code | Yes | We uploaded the code and data, and will release and periodically update the benchmark data and its related code. |
| Open Datasets | Yes | General-domain benchmarks: MMLU [17], Bool Q [9], Hella Swag [35], ARC [10], Common Sense QA [29], AGIEval [40], Openbook QA [21], GPQA [24], Wino Grande [25], Trivia QA [19], DROP [14], and BBH [28]. Domain-specific benchmarks: Math: GSM8K [24] and MATH [18]; Coding: MBPP [1] and Human Eval [5]; Physics: PIQA [4]; and Social Interactions: SIQA [26]. |
| Dataset Splits | No | The paper states it selected 'development and test splits' from existing benchmarks, but it does not specify exact percentages, sample counts, or detailed methodology for train/validation splits used for its own experiments or for the construction of Mix Eval beyond using existing benchmark data. |
| Hardware Specification | Yes | Models are evaluated on 4 or 8 A100 GPUs. |
| Software Dependencies | No | The paper mentions software like 'Transformers library', 'Sentence Transformers', and specific model versions like 'GPT-3.5-Turbo-0125', but it does not provide specific version numbers for software dependencies (e.g., 'Transformers v4.x.x', 'Python 3.x'). |
| Experiment Setup | Yes | Chat models employ official chat templates or Fast Chat chat templates [39], and base models are evaluated in a 5-shot setting. Both Mix Eval and Mix Eval-Hard, comprising samples from various benchmarks, demonstrate the inadequacies of traditional rule-based parsing methods across all benchmarks and models. To improve parsing accuracy, we use GPT-3.5-Turbo-0125 as the model parser to either score the response (free-form questions) or extract the model s choice (multiple-choice problems). |