Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Towards Trustworthy Knowledge Graph Reasoning: An Uncertainty Aware Perspective

Authors: Bo Ni, Yu Wang, Lu Cheng, Erik Blasch, Tyler Derr

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments show that UAG can achieve any pre-defined coverage rate while reducing the prediction set/interval size by 40% on average over the baselines. In this section, we conduct extensive experiments on two widely used multi-hop KGQA datasets (Luo et al. 2024; Sun et al. 2023; Lan et al. 2022). Our experiments are designed to rigorously evaluate the uncertainty quantification performance of UAG, employing standard UQ metrics to assess its effectiveness against state-of-the-art baselines.
Researcher Affiliation Academia Bo Ni1, Yu Wang2, Lu Cheng3, Erik Blasch4, Tyler Derr1 1Vanderbilt University 2University of Oregon 3University of Illinois Chicao 4Air Force Research Lab
Pseudocode No The paper describes the UAG framework and its components using descriptive text and mathematical equations (e.g., Eq. 4, 5, 6) but does not present any structured pseudocode or algorithm blocks.
Open Source Code Yes Other implementation details are further documented in the appendix and our code is publicly available2. 2https://github.com/Arstanley/UAG
Open Datasets Yes We evaluate UAG with two widely used benchmark dataset for KGQA: Web Question SP (Web QSP) (Yih et al. 2016) and Complex Web Questions (CWQ) (Talmor and Berant 2018). ... Additionally, Freebase (Bollacker et al. 2008) is used as the underlying knowledge graph for both datasets.
Dataset Splits Yes The dataset statistics are presented in Table 1. Note that for calibration, we use the training partition. Table 1: Statistics of KGQA datasets. #Train #Test Web QSP 2,826 1,628 CWQ 27,639 3,531
Hardware Specification No The paper mentions using 'Llama38b as our backbone large language model' and a 'Sentence Transformer model' but does not specify any hardware details (e.g., GPU/CPU models, memory) used for training or inference.
Software Dependencies No The paper mentions using 'Llama38b as our backbone large language model' and a 'Sentence Transformer model' but does not provide specific version numbers for any software libraries, frameworks, or programming languages used in the implementation.
Experiment Setup Yes For the implementation, we set δ to be 0.05, and use Llama38b as our backbone large language model. For the encoder g, we use the Sentence Transformer model and pre-train it on our training data. We select α = 0.2 for the purpose of testing.