Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

AI Debate Aids Assessment of Controversial Claims

Authors: Salman Rahman, Sheriff Issaka, Ashima Suvarna, Genglin Liu, James Shiffer, Jaeyoung Lee, Md Rizwan Parvez, Hamid Palangi, Shi Feng, Nanyun Peng, Yejin Choi, Julian Michael, Liwei Jiang, Saadia Gabriel

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct two studies. Study I recruits human judges with either mainstream or skeptical beliefs who evaluate claims through two protocols: debate (interaction with two AI advisors arguing opposing sides) or consultancy (interaction with a single AI advisor). Study II uses AI judges with and without human-like personas to evaluate the same protocols. In Study I, debate consistently improves human judgment accuracy and confidence calibration, outperforming consultancy by 4-10% across COVID-19 and climate change claims. ... Our LLM judge study (3) tests persona-based judges that emulate specific human demographic profiles and beliefs, comparing their oversight performance against human judges in both personalized and non-personalized settings.
Researcher Affiliation Collaboration 1University of California, Los Angeles 2Seoul National University 3Qatar Computing Research Institute 4Google 5George Washington University 6Stanford University 7Scale AI 8University of Washington
Pseudocode No The paper describes the experimental protocols and LLM prompts in detail within the appendices, structured as text. However, it does not contain formal pseudocode blocks or algorithms in a traditional sense for its methodology.
Open Source Code Yes EMAIL Code & Data: https://github.com/salman-lui/ai-debate We will open-source all the code and artifacts upon publication.
Open Datasets Yes Ultimately we identified COVID-19 factuality claims as a promising test domain, and collected claims from Check COVID [53, 55], a carefully constructed dataset where labels are established through expert verification and documentation against scientific journal articles from the CORD-19 dataset [56]. ... To validate generalization, we conducted an additional human study using climate change claims from Climate-Fever [17, 19], curated following the same criteria.
Dataset Splits No We collected claims from Check COVID [53, 55]... resulting in 121 COVID-19-related claims. To validate generalization, we conducted an additional human study using climate change claims from Climate-Fever [17, 19], curated following the same criteria... resulting in a dataset used for both human judge studies (146 participants across 845 sessions) and LLM judge experiments (184 claims total).
Hardware Specification No We implemented two AI intervention protocols using GPT-4o (temperature t = 0.2) to evaluate factuality claims. ... Our LLM judge study (3) tests persona-based judges that emulate specific human demographic profiles and beliefs... We also evaluate standard LLM judges (GPT-4o and Qwen-2.5-7B) without persona conditioning. ... We are grateful to Open AI and QCRI for providing API credits that supported our experiments.
Software Dependencies No We implemented two AI intervention protocols using GPT-4o (temperature t = 0.2) to evaluate factuality claims. ... We use LLMs (GPT-4o, Gemini-2.0-Flash) as evaluators to rate the prevalence of each strategy on an ordinal scale that ranges from none to high [16].
Experiment Setup Yes We implemented two AI intervention protocols using GPT-4o (temperature t = 0.2) to evaluate factuality claims. ... Debate In this protocol, two GPT-4o debaters simultaneously argue for opposing positions (true vs. false) in an adversarial format. The interaction follows a strict turn-taking structure (debater A, debater B, judge) for three complete rounds without interruptions. ... Consultancy (Baseline) This protocol features a single GPT-4o consultant arguing for an assigned position (true or false) in a non-adversarial setting. The three structured rounds consist of: (1) a consultant presents initial arguments and the judge raises questions, (2) the consultant responds to the judge s questions, and (3) the consultant provides final evidence. ... Persona-based LLM judges. To emulate human judgment realistically, we assign each LLM judge (GPT-4o) a persona matching the demographic attributes and COVID-19 beliefs from our human study (Section 2.1.3). These personas incorporate demographic factors (age, gender, education, location type, political stance) and COVID-19 beliefs (origin, vaccine efficacy, mask effectiveness).