ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate

Authors: Chi-Min Chan, Weize Chen, Yusheng Su, Jianxuan Yu, Wei Xue, Shanghang Zhang, Jie Fu, Zhiyuan Liu

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments on two benchmarks illustrate that Chat Eval delivers superior accuracy and correlation in alignment with human assessment.
Researcher Affiliation Academia Chi-Min Chan1, Weize Chen1, Yusheng Su1, Jianxuan Yu1, Wei Xue2, Shanghang Zhang3, Jie Fu2, Zhiyuan Liu1 1 Tsinghua University 2 Hong Kong University of Science and Technology 3 Peking University
Pseudocode Yes C FORMAL DEPICTION OF DIFFERENT COMMUNICATION STRATEGY ... Algorithm 1: One-by-One ... Algorithm 2: Simultaneous-Talk ... Algorithm 3: Simultaneous-Talk-with-Summarizer
Open Source Code No The paper does not provide any concrete access to source code for the methodology described, such as a specific repository link or an explicit code release statement.
Open Datasets Yes We evaluate Chat Eval on two benchmarks, Fair Eval and Topical-Chat ... We then take the human annotation results from Wu et al. (2023) to conduct the experiments in this paper. ... We draw upon the Topical-Chat (Gopalakrishnan et al., 2019) dataset for our study.
Dataset Splits No The paper uses existing benchmarks (Fair Eval and Topical-Chat) with human annotation results but does not specify how these datasets were split into training, validation, and test sets for the purpose of their LLM-based evaluation experiments. It focuses on evaluating correlation with human judgments on these datasets.
Hardware Specification No The paper mentions using OpenAI's GPT-4 and Chat GPT (GPT-3.5-turbo) models, but it does not specify any hardware details (e.g., GPU models, CPU types, or cloud instance specifications) used to run these models or their experiments.
Software Dependencies No The paper mentions using OpenAI's GPT-4 and Chat GPT (GPT-3.5-turbo) and open-sourced models like Llama2-Chat-7b and Vicuna-7b-v1.5, but it does not provide specific version numbers for ancillary software dependencies like programming languages, libraries, or frameworks beyond the model names themselves.
Experiment Setup Yes In our current research, we focus on homogeneous groups of LLMs. That is, within a given multi-agent group, all LLMs belong to the same GPT family model, either all GPT-4 or all Chat GPT. We acknowledge the potential of heterogeneous groups for future research, which could provide fascinating insights into how strong models and weak models can cooperate in a multi-agent setting. Additionally, unlike previous work like Du et al. (2023), we do not explicitly ask the debater agents to reach a consensus at the end of the debate. In situations where the response format relies on direct comparison, we derive the final results from the majority vote among various annotators. Conversely, if the response format requires a direct score, we calculate the average score obtained from multiple annotators. This methodological approach ensures the impartiality and balance of our evaluation process." and "By default, we configure the communication strategy to one-by-one, agent numbers to 2, and discussion turns to 2 in this section and employ position calibration techniques in both single-agent and multi-agent settings. We will discuss more debate configurations in Section 4 for completeness.