Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

BiDeV: Bilateral Defusing Verification for Complex Claim Fact-Checking

Authors: Yuxuan Liu, Hongda Sun, Wenya Guo, Xinyan Xiao, Cunli Mao, Zhengtao Yu, Rui Yan

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experimental results on two widely used challenging fact-checking benchmarks (Hover and Feverous-s) demonstrate that our Bi De V can achieve the best performance under both gold and open settings.
Researcher Affiliation	Collaboration	1 Gaoling School of Artificial Intelligence, Renmin University of China 2 Nankai University 3 Baidu Inc. 4 Kunming University of Science and Technology EMAIL, EMAIL, EMAIL, EMAIL, EMAIL
Pseudocode	No	The overview of our Bi De V is shown in Figure 2. In the subsequent sections, we will introduce how to integrate LLMs to eliminate the vagueness in the claim and the redundancy in the evidence. (Figure 2 is a diagram, not pseudocode). Figure 7: Case Study of selected baselines (FOLK and Program FC) and our Bi De V. (The pseudocode-like structures in Figure 7 are for baselines, not Bi De V's core algorithm).
Open Source Code	Yes	Code https://github.com/Ethan Leo-LYX/Bi De V
Open Datasets	Yes	Datasets. There are two widely used and challenging datasets to evaluate the fact-checking performance of baselines and our Bi De V: (i) Hover (Jiang et al. 2020) and (ii) Feverous-s (Pan et al. 2023).
Dataset Splits	Yes	Datasets. There are two widely used and challenging datasets to evaluate the fact-checking performance of baselines and our Bi De V: (i) Hover (Jiang et al. 2020) and (ii) Feverous-s (Pan et al. 2023).
Hardware Specification	No	In our proposed method, we use gpt-3.5-turbo as the base model of Perceptor, Rewriter, Decomposer, and Filter by accessing to Open AI API with few-shot demonstrations. For a fair comparison, we leverage Flan-T5-XL (3B) as the Querier and Checker without additional fine-tuning. The paper does not provide specific hardware details like GPU/CPU models.
Software Dependencies	No	In our proposed method, we use gpt-3.5-turbo as the base model of Perceptor, Rewriter, Decomposer, and Filter by accessing to Open AI API with few-shot demonstrations. For a fair comparison, we leverage Flan-T5-XL (3B) as the Querier and Checker without additional fine-tuning. The paper does not specify software versions for reproducibility.
Experiment Setup	Yes	In the vagueness defusing, we iteratively perceive-then-rewrite for 3 rounds.