Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
BiDeV: Bilateral Defusing Verification for Complex Claim Fact-Checking
Authors: Yuxuan Liu, Hongda Sun, Wenya Guo, Xinyan Xiao, Cunli Mao, Zhengtao Yu, Rui Yan
AAAI 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experimental results on two widely used challenging fact-checking benchmarks (Hover and Feverous-s) demonstrate that our Bi De V can achieve the best performance under both gold and open settings. |
| Researcher Affiliation | Collaboration | 1 Gaoling School of Artificial Intelligence, Renmin University of China 2 Nankai University 3 Baidu Inc. 4 Kunming University of Science and Technology EMAIL, EMAIL, EMAIL, EMAIL, EMAIL |
| Pseudocode | No | The overview of our Bi De V is shown in Figure 2. In the subsequent sections, we will introduce how to integrate LLMs to eliminate the vagueness in the claim and the redundancy in the evidence. (Figure 2 is a diagram, not pseudocode). Figure 7: Case Study of selected baselines (FOLK and Program FC) and our Bi De V. (The pseudocode-like structures in Figure 7 are for baselines, not Bi De V's core algorithm). |
| Open Source Code | Yes | Code https://github.com/Ethan Leo-LYX/Bi De V |
| Open Datasets | Yes | Datasets. There are two widely used and challenging datasets to evaluate the fact-checking performance of baselines and our Bi De V: (i) Hover (Jiang et al. 2020) and (ii) Feverous-s (Pan et al. 2023). |
| Dataset Splits | Yes | Datasets. There are two widely used and challenging datasets to evaluate the fact-checking performance of baselines and our Bi De V: (i) Hover (Jiang et al. 2020) and (ii) Feverous-s (Pan et al. 2023). |
| Hardware Specification | No | In our proposed method, we use gpt-3.5-turbo as the base model of Perceptor, Rewriter, Decomposer, and Filter by accessing to Open AI API with few-shot demonstrations. For a fair comparison, we leverage Flan-T5-XL (3B) as the Querier and Checker without additional fine-tuning. The paper does not provide specific hardware details like GPU/CPU models. |
| Software Dependencies | No | In our proposed method, we use gpt-3.5-turbo as the base model of Perceptor, Rewriter, Decomposer, and Filter by accessing to Open AI API with few-shot demonstrations. For a fair comparison, we leverage Flan-T5-XL (3B) as the Querier and Checker without additional fine-tuning. The paper does not specify software versions for reproducibility. |
| Experiment Setup | Yes | In the vagueness defusing, we iteratively perceive-then-rewrite for 3 rounds. |