Rethinking Generative Large Language Model Evaluation for Semantic Comprehension
Authors: Fangyun Wei, Xi Chen, Lin Luo
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through a comprehensive evaluation of 24 models across 11 benchmarks, we highlight several potential drawbacks of MCQA, for instance, the inconsistency between the MCQA evaluation and the generation of open-ended responses in practical scenarios. In response, we introduce an RWQ-Elo rating system, engaging 24 LLMs such as GPT-4, GPT-3.5, Google-Gemini-Pro and LLa MA-1/-2, in a two-player competitive format, with GPT-4 serving as the judge. Each LLM receives an Elo rating thereafter. |
| Researcher Affiliation | Industry | Fangyun Wei * 1 Xi Chen * 1 Lin Luo * 1 1Microsoft Research Asia. Correspondence to: Fangyun Wei <fawe@microsoft.com>. |
| Pseudocode | No | The paper describes the RWQ-Elo Rating Algorithm in text but does not include a formally labeled pseudocode block or algorithm box. |
| Open Source Code | No | The paper mentions a "Project page: https://luolinrowling.github.io/Rethink LLM-Eval" in the abstract, but it does not explicitly state that the source code for the methodology is available at this link. It does not contain an unambiguous sentence explicitly stating code release. |
| Open Datasets | No | This system is designed to mirror real-world usage, and for this purpose, we have compiled a new benchmark called Realworld questions (RWQ), comprising 20,772 authentic user inquiries. This dataset comprises 20,772 authentic questions sourced from various platforms such as Google Trends2, Quora, Share GPT, LMSYS-Chat-1M, and Alpaca Eval (Li et al., 2023a). While the RWQ dataset is compiled from publicly accessible platforms, the paper does not provide a direct URL, DOI, or specific repository name for the compiled RWQ dataset itself. |
| Dataset Splits | No | The paper describes random sampling of questions from the RWQ database for competitive rounds: "During each competition round, we randomly pair two LLMs (referred to as LLM-A and LLM-B) and present them with a question sampled from our RWQ database." However, it does not specify explicit training/validation/test splits, sample counts, or predefined splits for reproducibility. |
| Hardware Specification | Yes | Response Generation. In each contest, the two participating LLMs generate responses to the question. We record the responses from all contests. This process takes 30 hours on 8 Nvidia A100 (80G) GPUs. |
| Software Dependencies | Yes | As of March 28th, 2024, the cost for using GPT-4-Turbo-1106-preview is $0.01 per 1000 input tokens. |
| Experiment Setup | Yes | Our RWQ-Elo system requires calling GPT-4 N (N 1) H/2 times, where N = 24 and H = 200 are the default values. For each call, we feed the combination of the evaluation prompt, the question, and the responses generated by two LLMs into GPT-4. The average length of each GPT-4 input is 1088 tokens. ... K represents the K-factor, which is set to 4 by default. ... In our implementation, we set C to 100. |