Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
KGARevion: An AI Agent for Knowledge-Intensive Biomedical QA
Authors: Xiaorui Su, Yibo Wang, Shanghua Gao, Xiaolong Liu, Valentina Giunchiglia, Djork-ArnΓ© Clevert, Marinka Zitnik
ICLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Evaluations on medical QA benchmarks show that KGAREVION improves accuracy by over 5.2% over 15 models in handling complex medical queries. To further assess its effectiveness, we curated three new medical QA datasets with varying levels of semantic complexity, where KGAREVION improved accuracy by 10.4%. |
| Researcher Affiliation | Collaboration | Xiaorui Su1 Yibo Wang2 Shanghua Gao1 Xiaolong Liu2 Valentina Giunchiglia3 Djork-Arn e Clevert4 Marinka Zitnik1 1Harvard University 2University of Illinois Chicago 3Imperial College London 4 Pfizer |
| Pseudocode | No | The paper describes the KGAREVION agent's actions (Generate, Review, Revise, Answer) in narrative text and figures, but does not present any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | KGAREVION is available at https://github.com/mims-harvard/KGARevion. |
| Open Datasets | Yes | We first start with four multi-choice medical QA benchmarks (Xiong et al., 2024a) (Table 1). In addition, we introduce a new benchmark for multi-choice complex medical QA focused on differential diagnosis (DDx), named Med DDx... Afri Med-QA, a newly published QA dataset released after all baseline models in this study (Olatunji et al., 2024). |
| Dataset Splits | Yes | During the fine-tuning stage, we first split Prime KG Chandak et al. (2023) into two parts: a training set and a testing set, in a ratio of 8:2. |
| Hardware Specification | Yes | All experiments are conducted on a machine equipped with 4 NVIDIA H100. We use 1 NVIDIA H100 to implement baselines with small LLMs. In the fine-tuning stage, we use 4 NVIDIA H100 to fine-tune the review module. |
| Software Dependencies | Yes | We implement KGAREVION using Python 3.9.19, Py Torch 2.3.1, Transformers 4.43.1, and Tokenizers 0.19.1. |
| Experiment Setup | Yes | For hyperparameter tuning, we use grid search to identify the optimal parameter combinations by evaluating the fine-tuned model s performance on the knowledge graph completion task using the testing set. Specifically, we focus on the parameter r in Lo RA training and the batch size during the fine-tuning stage. The values explored for r are 16, 32, 64, 128, while the tested batch sizes bz are 128, 256, 512, 1024. The best parameters identified are r = 32, bz = 256. |