reproducibilityindex.ai

Debating with More Persuasive LLMs Leads to More Truthful Answers

Authors: Akbir Khan, John Hughes, Dan Valentine, Laura Ruis, Kshitij Sachan, Ansh Radhakrishnan, Edward Grefenstette, Samuel R. Bowman, Tim Rocktäschel, Ethan Perez

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	On the QuALITY comprehension task, we find that debate consistently helps both non-expert models and humans answer questions, achieving 76% and 88% accuracy respectively (naive baselines obtain 48% and 60%).
Researcher Affiliation	Collaboration	1University College London 2Speechmatics 3MATS 4Anthropic 5FAR AI.
Pseudocode	Yes	Algorithm 1 Best-of-N Sampling and critique-and-refinement in Debate Protocol
Open Source Code	Yes	The code we used is available at https://github. com/ucl-dark/llm_debate.
Open Datasets	Yes	Task We evaluate the ability of non-expert judges to answer questions from the reading comprehension dataset Question Answering with Long Input Texts, Yes! (QuALITY; Pang et al., 2022).
Dataset Splits	Yes	We use two data splits for LLM judge experiments: TL (400 train set questions) and DL (291 development set questions). For human experiments, where a story can only appear once, we use TH (153 drawn from both sets) and DH (47 drawn from both sets).
Hardware Specification	No	No specific hardware (GPU models, CPU models, or cloud instance types) is mentioned for running experiments. It only refers to the use of various large language models.
Software Dependencies	No	The paper lists large language models used (e.g., 'GPT-4-Turbo', 'Claude 2.1', 'Mixtral 8x7B') with their respective citation years, but does not provide specific software library versions (e.g., Python, PyTorch, CUDA versions) required for reproduction.
Experiment Setup	Yes	We run protocols for three rounds. To control for the quantity of information presented to the judge across protocols and mitigate the LLM judge verbosity bias, we restrict transcripts to 900 words in total, limiting consultants to 300 words per argument and debaters to 150 words. We apply inference-time optimisation using best-of-N (bo N) sampling. With bo N, models are sampled N times, and a preference model is used to select the most persuasive arguments. Table 5. The temperature used for each model as a function of best-of-N (bo N) or critique-of-N (c N).