Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Towards Trustworthy Knowledge Graph Reasoning: An Uncertainty Aware Perspective
Authors: Bo Ni, Yu Wang, Lu Cheng, Erik Blasch, Tyler Derr
AAAI 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments show that UAG can achieve any pre-defined coverage rate while reducing the prediction set/interval size by 40% on average over the baselines. In this section, we conduct extensive experiments on two widely used multi-hop KGQA datasets (Luo et al. 2024; Sun et al. 2023; Lan et al. 2022). Our experiments are designed to rigorously evaluate the uncertainty quantification performance of UAG, employing standard UQ metrics to assess its effectiveness against state-of-the-art baselines. |
| Researcher Affiliation | Academia | Bo Ni1, Yu Wang2, Lu Cheng3, Erik Blasch4, Tyler Derr1 1Vanderbilt University 2University of Oregon 3University of Illinois Chicao 4Air Force Research Lab |
| Pseudocode | No | The paper describes the UAG framework and its components using descriptive text and mathematical equations (e.g., Eq. 4, 5, 6) but does not present any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Other implementation details are further documented in the appendix and our code is publicly available2. 2https://github.com/Arstanley/UAG |
| Open Datasets | Yes | We evaluate UAG with two widely used benchmark dataset for KGQA: Web Question SP (Web QSP) (Yih et al. 2016) and Complex Web Questions (CWQ) (Talmor and Berant 2018). ... Additionally, Freebase (Bollacker et al. 2008) is used as the underlying knowledge graph for both datasets. |
| Dataset Splits | Yes | The dataset statistics are presented in Table 1. Note that for calibration, we use the training partition. Table 1: Statistics of KGQA datasets. #Train #Test Web QSP 2,826 1,628 CWQ 27,639 3,531 |
| Hardware Specification | No | The paper mentions using 'Llama38b as our backbone large language model' and a 'Sentence Transformer model' but does not specify any hardware details (e.g., GPU/CPU models, memory) used for training or inference. |
| Software Dependencies | No | The paper mentions using 'Llama38b as our backbone large language model' and a 'Sentence Transformer model' but does not provide specific version numbers for any software libraries, frameworks, or programming languages used in the implementation. |
| Experiment Setup | Yes | For the implementation, we set δ to be 0.05, and use Llama38b as our backbone large language model. For the encoder g, we use the Sentence Transformer model and pre-train it on our training data. We select α = 0.2 for the purpose of testing. |