Debating with More Persuasive LLMs Leads to More Truthful Answers

Authors: Akbir Khan, John Hughes, Dan Valentine, Laura Ruis, Kshitij Sachan, Ansh Radhakrishnan, Edward Grefenstette, Samuel R. Bowman, Tim Rocktäschel, Ethan Perez

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental On the QuALITY comprehension task, we find that debate consistently helps both non-expert models and humans answer questions, achieving 76% and 88% accuracy respectively (naive baselines obtain 48% and 60%).
Researcher Affiliation Collaboration 1University College London 2Speechmatics 3MATS 4Anthropic 5FAR AI.
Pseudocode Yes Algorithm 1 Best-of-N Sampling and critique-and-refinement in Debate Protocol
Open Source Code Yes The code we used is available at https://github. com/ucl-dark/llm_debate.
Open Datasets Yes Task We evaluate the ability of non-expert judges to answer questions from the reading comprehension dataset Question Answering with Long Input Texts, Yes! (QuALITY; Pang et al., 2022).
Dataset Splits Yes We use two data splits for LLM judge experiments: TL (400 train set questions) and DL (291 development set questions). For human experiments, where a story can only appear once, we use TH (153 drawn from both sets) and DH (47 drawn from both sets).
Hardware Specification No No specific hardware (GPU models, CPU models, or cloud instance types) is mentioned for running experiments. It only refers to the use of various large language models.
Software Dependencies No The paper lists large language models used (e.g., 'GPT-4-Turbo', 'Claude 2.1', 'Mixtral 8x7B') with their respective citation years, but does not provide specific software library versions (e.g., Python, PyTorch, CUDA versions) required for reproduction.
Experiment Setup Yes We run protocols for three rounds. To control for the quantity of information presented to the judge across protocols and mitigate the LLM judge verbosity bias, we restrict transcripts to 900 words in total, limiting consultants to 300 words per argument and debaters to 150 words. We apply inference-time optimisation using best-of-N (bo N) sampling. With bo N, models are sampled N times, and a preference model is used to select the most persuasive arguments. Table 5. The temperature used for each model as a function of best-of-N (bo N) or critique-of-N (c N).