Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

BiDeV: Bilateral Defusing Verification for Complex Claim Fact-Checking

Authors: Yuxuan Liu, Hongda Sun, Wenya Guo, Xinyan Xiao, Cunli Mao, Zhengtao Yu, Rui Yan

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experimental results on two widely used challenging fact-checking benchmarks (Hover and Feverous-s) demonstrate that our Bi De V can achieve the best performance under both gold and open settings.
Researcher Affiliation Collaboration 1 Gaoling School of Artificial Intelligence, Renmin University of China 2 Nankai University 3 Baidu Inc. 4 Kunming University of Science and Technology EMAIL, EMAIL, EMAIL, EMAIL, EMAIL
Pseudocode No The overview of our Bi De V is shown in Figure 2. In the subsequent sections, we will introduce how to integrate LLMs to eliminate the vagueness in the claim and the redundancy in the evidence. (Figure 2 is a diagram, not pseudocode). Figure 7: Case Study of selected baselines (FOLK and Program FC) and our Bi De V. (The pseudocode-like structures in Figure 7 are for baselines, not Bi De V's core algorithm).
Open Source Code Yes Code https://github.com/Ethan Leo-LYX/Bi De V
Open Datasets Yes Datasets. There are two widely used and challenging datasets to evaluate the fact-checking performance of baselines and our Bi De V: (i) Hover (Jiang et al. 2020) and (ii) Feverous-s (Pan et al. 2023).
Dataset Splits Yes Datasets. There are two widely used and challenging datasets to evaluate the fact-checking performance of baselines and our Bi De V: (i) Hover (Jiang et al. 2020) and (ii) Feverous-s (Pan et al. 2023).
Hardware Specification No In our proposed method, we use gpt-3.5-turbo as the base model of Perceptor, Rewriter, Decomposer, and Filter by accessing to Open AI API with few-shot demonstrations. For a fair comparison, we leverage Flan-T5-XL (3B) as the Querier and Checker without additional fine-tuning. The paper does not provide specific hardware details like GPU/CPU models.
Software Dependencies No In our proposed method, we use gpt-3.5-turbo as the base model of Perceptor, Rewriter, Decomposer, and Filter by accessing to Open AI API with few-shot demonstrations. For a fair comparison, we leverage Flan-T5-XL (3B) as the Querier and Checker without additional fine-tuning. The paper does not specify software versions for reproducibility.
Experiment Setup Yes In the vagueness defusing, we iteratively perceive-then-rewrite for 3 rounds.