Multi-LLM Debate: Framework, Principals, and Interventions

Authors: Andrew Estornell, Yang Liu

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We also demonstrate that these interventions result in better performance on four common benchmark tasks. We also conduct experiments on four common benchmarks demonstrating that these interventions improve debate efficacy in practice.
Researcher Affiliation Collaboration Andrew Estornell Byte Dance Research andrew.estornell@bytedance.com Yang Liu University of California, Santa Cruz yangliu@ucsc.edu
Pseudocode Yes Algorithm 1 Application of Combined Interventions
Open Source Code No We cite all data used in the paper and we will release our code publicly upon publication.
Open Datasets Yes We conduct experiments on four common language model benchmarks. Bool Q Clark et al. [2019], which consists of 3, 270 yes-no questions, MMLU Hendrycks et al. [2020] which consists of 13, 869 multiple-choice questions (we use the 3, 406 high-school-level questions), Truthful QA Lin et al. [2021] which consists of 817 open-ended questions, and Math Q which consists of 3, 000 arithmetic questions of the from a b c + d e f.
Dataset Splits No The paper uses standard benchmarks but does not explicitly state specific training/validation/test splits (e.g., percentages or sample counts) for its experiments. While these benchmarks often have predefined splits, the paper does not specify which ones are used or how they partitioned the data.
Hardware Specification Yes For all experiments, we use one Nvdida Tesla V100 GPU and one Intel 32-core CPU.
Software Dependencies Yes Table 2: List of specific types of models used in experiments Model Name Model Version Library GPT-3.5 GPT-3.5 Turbo openai Llama-2 Llama-2 7B Chat huggingface Llama-3 Llama-3 8B Instruct huggingface Mistral Mistral 7B Instruct v02 huggingface
Experiment Setup Yes Table 1: Accuracy of a solo model, debate, and our debate interventions: 10 rounds, 6 models. We begin with a per-round performance of our method and So M, as shown in Figure 3. In the Bool Q, MMLU, Math Q, datasets model correctness is measured through regular expression matching. In the Truthful QA dataset, model correctness is measured via an LLM judge (we use GPT-4 as the judge in all experiments).