Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Enhancing Uncertainty Modeling with Semantic Graph for Hallucination Detection

Authors: Kedi Chen, Qin Chen, Jie Zhou, Xinqi Tao, Bowen Ding, Jingwen Xie, Mingchen Xie, Peilong Li, Zheng Feng

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on two datasets show the great advantages of our proposed approach. In particular, we obtain substantial improvements with 19.78% in passage-level hallucination detection. We perform experiments on two datasets, namely the well-known Wiki Bio (Manakul, Liusie, and Gales 2023) and our constructed Note Sum. The results show the great superiority of our approach in both sentence-level and passage-level hallucination detection. We conduct elaborate analyses of the experimental results on two benchmark datasets, and provide a better understanding of the effectiveness of our approach. Ablation Studies: We conduct ablation studies on Wiki Bio with LLa MA-30B from three dimensions: token, sentence, and passage. Experimental results are shown in Table 3.
Researcher Affiliation	Collaboration	Kedi Chen1* , Qin Chen1* , Jie Zhou1, Xinqi Tao2, Bowen Ding2, Jingwen Xie2, Mingchen Xie2, Peilong Li2, Feng Zheng2 1East China Normal University 2Xiaohongshu Inc. EMAIL EMAIL EMAIL
Pseudocode	No	The paper describes the methodology in narrative text and uses formulas (e.g., U(tj i), Io, UE(i), UG(i), Us(i), Up) and an overall framework diagram (Figure 2). It does not contain a formally structured pseudocode or algorithm block.
Open Source Code	No	The paper does not contain an explicit statement about releasing code or a link to a code repository.
Open Datasets	Yes	We conduct extensive experiments on two datasets for hallucination detection. One is currently the latest and most widely used dataset Wiki Bio. Wiki Bio (Manakul, Liusie, and Gales 2023) is a dataset derived from Wikipedia biographies.
Dataset Splits	No	The paper describes the annotation of sentences within the datasets (Factual, Non Fact*, Non Fact) and provides statistics (Table 1) but does not specify how the datasets were split into training, validation, or test sets for the experiments.
Hardware Specification	No	The paper mentions using specific models like LLa MA-13B and LLa MA-30B and a DeBERTa-v3-Large NLI model, but it does not provide any specific hardware details such as GPU models, CPU types, or memory specifications used for running the experiments.
Software Dependencies	No	The paper mentions several software tools and models, including a 'transition-based AMR parser (Xu, Lee, and Huang 2023)', 'spaCy' for coreference resolution and entity linking, 'DeBERTa-v3-Large (He, Gao, and Chen 2023) NLI model', and 'LLa MA-13B and LLa MA-30B models'. However, it does not specify version numbers for general software dependencies like Python, specific library versions, or the exact version of spaCy used.
Experiment Setup	Yes	The hyper-parameters α, β, λ, and k are set to 0.8, 0.65, 0.7, and 3 respectively.