Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Towards Trustworthy Knowledge Graph Reasoning: An Uncertainty Aware Perspective

Authors: Bo Ni, Yu Wang, Lu Cheng, Erik Blasch, Tyler Derr

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments show that UAG can achieve any pre-defined coverage rate while reducing the prediction set/interval size by 40% on average over the baselines. In this section, we conduct extensive experiments on two widely used multi-hop KGQA datasets (Luo et al. 2024; Sun et al. 2023; Lan et al. 2022). Our experiments are designed to rigorously evaluate the uncertainty quantification performance of UAG, employing standard UQ metrics to assess its effectiveness against state-of-the-art baselines.
Researcher Affiliation	Academia	Bo Ni1, Yu Wang2, Lu Cheng3, Erik Blasch4, Tyler Derr1 1Vanderbilt University 2University of Oregon 3University of Illinois Chicao 4Air Force Research Lab
Pseudocode	No	The paper describes the UAG framework and its components using descriptive text and mathematical equations (e.g., Eq. 4, 5, 6) but does not present any structured pseudocode or algorithm blocks.
Open Source Code	Yes	Other implementation details are further documented in the appendix and our code is publicly available2. 2https://github.com/Arstanley/UAG
Open Datasets	Yes	We evaluate UAG with two widely used benchmark dataset for KGQA: Web Question SP (Web QSP) (Yih et al. 2016) and Complex Web Questions (CWQ) (Talmor and Berant 2018). ... Additionally, Freebase (Bollacker et al. 2008) is used as the underlying knowledge graph for both datasets.
Dataset Splits	Yes	The dataset statistics are presented in Table 1. Note that for calibration, we use the training partition. Table 1: Statistics of KGQA datasets. #Train #Test Web QSP 2,826 1,628 CWQ 27,639 3,531
Hardware Specification	No	The paper mentions using 'Llama38b as our backbone large language model' and a 'Sentence Transformer model' but does not specify any hardware details (e.g., GPU/CPU models, memory) used for training or inference.
Software Dependencies	No	The paper mentions using 'Llama38b as our backbone large language model' and a 'Sentence Transformer model' but does not provide specific version numbers for any software libraries, frameworks, or programming languages used in the implementation.
Experiment Setup	Yes	For the implementation, we set δ to be 0.05, and use Llama38b as our backbone large language model. For the encoder g, we use the Sentence Transformer model and pre-train it on our training data. We select α = 0.2 for the purpose of testing.