Multi-LLM Debate: Framework, Principals, and Interventions
Authors: Andrew Estornell, Yang Liu
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We also demonstrate that these interventions result in better performance on four common benchmark tasks. We also conduct experiments on four common benchmarks demonstrating that these interventions improve debate efficacy in practice. |
| Researcher Affiliation | Collaboration | Andrew Estornell Byte Dance Research andrew.estornell@bytedance.com Yang Liu University of California, Santa Cruz yangliu@ucsc.edu |
| Pseudocode | Yes | Algorithm 1 Application of Combined Interventions |
| Open Source Code | No | We cite all data used in the paper and we will release our code publicly upon publication. |
| Open Datasets | Yes | We conduct experiments on four common language model benchmarks. Bool Q Clark et al. [2019], which consists of 3, 270 yes-no questions, MMLU Hendrycks et al. [2020] which consists of 13, 869 multiple-choice questions (we use the 3, 406 high-school-level questions), Truthful QA Lin et al. [2021] which consists of 817 open-ended questions, and Math Q which consists of 3, 000 arithmetic questions of the from a b c + d e f. |
| Dataset Splits | No | The paper uses standard benchmarks but does not explicitly state specific training/validation/test splits (e.g., percentages or sample counts) for its experiments. While these benchmarks often have predefined splits, the paper does not specify which ones are used or how they partitioned the data. |
| Hardware Specification | Yes | For all experiments, we use one Nvdida Tesla V100 GPU and one Intel 32-core CPU. |
| Software Dependencies | Yes | Table 2: List of specific types of models used in experiments Model Name Model Version Library GPT-3.5 GPT-3.5 Turbo openai Llama-2 Llama-2 7B Chat huggingface Llama-3 Llama-3 8B Instruct huggingface Mistral Mistral 7B Instruct v02 huggingface |
| Experiment Setup | Yes | Table 1: Accuracy of a solo model, debate, and our debate interventions: 10 rounds, 6 models. We begin with a per-round performance of our method and So M, as shown in Figure 3. In the Bool Q, MMLU, Math Q, datasets model correctness is measured through regular expression matching. In the Truthful QA dataset, model correctness is measured via an LLM judge (we use GPT-4 as the judge in all experiments). |