Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

KGARevion: An AI Agent for Knowledge-Intensive Biomedical QA

Authors: Xiaorui Su, Yibo Wang, Shanghua Gao, Xiaolong Liu, Valentina Giunchiglia, Djork-Arné Clevert, Marinka Zitnik

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Evaluations on medical QA benchmarks show that KGAREVION improves accuracy by over 5.2% over 15 models in handling complex medical queries. To further assess its effectiveness, we curated three new medical QA datasets with varying levels of semantic complexity, where KGAREVION improved accuracy by 10.4%.
Researcher Affiliation	Collaboration	Xiaorui Su1 Yibo Wang2 Shanghua Gao1 Xiaolong Liu2 Valentina Giunchiglia3 Djork-Arn e Clevert4 Marinka Zitnik1 1Harvard University 2University of Illinois Chicago 3Imperial College London 4 Pfizer
Pseudocode	No	The paper describes the KGAREVION agent's actions (Generate, Review, Revise, Answer) in narrative text and figures, but does not present any structured pseudocode or algorithm blocks.
Open Source Code	Yes	KGAREVION is available at https://github.com/mims-harvard/KGARevion.
Open Datasets	Yes	We first start with four multi-choice medical QA benchmarks (Xiong et al., 2024a) (Table 1). In addition, we introduce a new benchmark for multi-choice complex medical QA focused on differential diagnosis (DDx), named Med DDx... Afri Med-QA, a newly published QA dataset released after all baseline models in this study (Olatunji et al., 2024).
Dataset Splits	Yes	During the fine-tuning stage, we first split Prime KG Chandak et al. (2023) into two parts: a training set and a testing set, in a ratio of 8:2.
Hardware Specification	Yes	All experiments are conducted on a machine equipped with 4 NVIDIA H100. We use 1 NVIDIA H100 to implement baselines with small LLMs. In the fine-tuning stage, we use 4 NVIDIA H100 to fine-tune the review module.
Software Dependencies	Yes	We implement KGAREVION using Python 3.9.19, Py Torch 2.3.1, Transformers 4.43.1, and Tokenizers 0.19.1.
Experiment Setup	Yes	For hyperparameter tuning, we use grid search to identify the optimal parameter combinations by evaluating the fine-tuned model s performance on the knowledge graph completion task using the testing set. Specifically, we focus on the parameter r in Lo RA training and the batch size during the fine-tuning stage. The values explored for r are 16, 32, 64, 128, while the tested batch sizes bz are 128, 256, 512, 1024. The best parameters identified are r = 32, bz = 256.