Improving Factuality and Reasoning in Language Models through Multiagent Debate
Authors: Yilun Du, Shuang Li, Antonio Torralba, Joshua B. Tenenbaum, Igor Mordatch
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In our experiments, we evaluate our multiagent debate procedure and answer the following questions: (1) To what extent does multiagent debate improve reasoning? (2) To what extent does multiagent debate improve factual validity? (3) What design choices enable multiagent debate to improve language generation performance? We report the accuracy of final answers in arithmetic and GSM8K tasks and report the pawn score (advantage) of predicted moves, as estimated by Stockfish in the Chess Moves. In Table 1, we report the results of each approach on arithmetic, grade school math, and chess reasoning tasks. |
| Researcher Affiliation | Collaboration | 1MIT 2Google Deepmind. Correspondence to: Yilun Du <yilundu@mit.edu>. |
| Pseudocode | No | The paper describes its procedure using textual explanations and illustrative figures, but does not include formal pseudocode or algorithm blocks. |
| Open Source Code | Yes | An anonymous repo with the code of the paper can be found in https://anonymous.4open.science/r/llm_multiagent_debate_anonymous-BE27/README.md. |
| Open Datasets | Yes | Using the GSM8K dataset (Cobbe et al., 2021), the models must correctly solve grade school mathematical reasoning tasks. We utilize the existing MMLU dataset (Hendrycks et al., 2020) to benchmark the accuracy of responses. Specifically, we measure the validity of possible moves in a game of Chess given by BIG-Bench Chess-State Tracking Benchmark (Srivastava et al., 2022) task of chess-move prediction. |
| Dataset Splits | No | The paper evaluates pre-trained language models in zero-shot or few-shot settings on existing benchmarks. It specifies the number of samples used for evaluation (e.g., 'one hundred randomly selected grade school math problems') which serves as their test sets, but does not provide specific training/validation/test splits in terms of percentages or counts for model training, as they are not training new models in their experiments. |
| Hardware Specification | No | The paper mentions the language models used (e.g., 'chat GPT-3.5 language model (Open AI, 2022)', 'GPT-4', 'Llama-7B'), but does not specify the underlying hardware (e.g., GPU models, CPU types, memory) used to run their experiments. |
| Software Dependencies | No | The paper mentions various language models and tools (e.g., 'chat GPT-3.5', 'GPT-4', 'Bard', 'Stockfish') but does not provide specific version numbers for general software dependencies, programming languages, or libraries required for reproducibility. |
| Experiment Setup | Yes | Due to computational expense, we evaluate our approach across benchmarks mainly using three agents with two rounds of debates, although we found further gains with both more agents and rounds of debate (Figure 9). We found that we could control the duration of debates by changing how much a language model trusts its own outputs over those generated by other models through different prompts. We illustrate two prompts below in Figure 3, which we use to induce different debate durations between language models, and illustrate the effect of such prompts in Figure 10. We use the same methodology and prompt templates for all our tasks and require only black-box access to language model generations no model-internal information such as likelihoods or gradients is needed. |