Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Knowledge Graph Prompting for Multi-Document Question Answering

Authors: Yu Wang, Nedim Lipka, Ryan A. Rossi, Alexa Siu, Ruiyi Zhang, Tyler Derr

AAAI 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments underscore the efficacy of KGP for MD-QA, signifying the potential of leveraging graphs in enhancing the prompt design and retrieval augmented generation for LLMs. Our code: https://github.com/Yu WVandy/KG-LLM-MDQA. ... We compare the MD-QA performance of the proposed KGP-T5 and other baselines in Table 1.
Researcher Affiliation	Collaboration	Yu Wang,1 Nedim Lipka,2 Ryan A. Rossi,2 Alexa Siu,2 Ruiyi Zhang,2 Tyler Derr1 1 Vanderbilt University, Nashville, USA 2 Adobe Research, San Jose, USA
Pseudocode	Yes	Algorithm 1: LLM-based KG Traversal Algorithm to Retrieve Relevant Context for Content-based Question.
Open Source Code	Yes	Our code: https://github.com/Yu WVandy/KG-LLM-MDQA.
Open Datasets	Yes	we randomly sample multi-document questions from the development set of 2Wiki MQA (Ho et al. 2020) and Mu Si Que (Trivedi et al. 2022b)... We randomly sample questions from Hotpot QA and construct KGs over the set of documents for each of these questions using our proposed methods.
Dataset Splits	No	The paper mentions sampling from 'development set' for some datasets and using 'Hotpot QA', but does not explicitly provide the specific train/validation/test splits (e.g., percentages or sample counts) used in their experiments.
Hardware Specification	No	No specific hardware details (e.g., GPU/CPU models, memory) used for running experiments are mentioned in the paper.
Software Dependencies	No	The paper mentions various software components and models (e.g., T5, LLaMA, RoBERTa-base, TF-IDF, Extract-PDF API) but does not provide specific version numbers for these dependencies.
Experiment Setup	No	The paper states, 'Detailed experimental setting is presented in Section 13. Due to the space limitation, we comprehensively introduce our experimental setting, including dataset collection, baselines, and evaluation criteria, in Supplementary 8.1-8.2.' These details are deferred to an external supplementary document not provided in the main paper.