Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

KBQA-o1: Agentic Knowledge Base Question Answering with Monte Carlo Tree Search

Authors: Haoran Luo, Haihong E, Yikai Guo, Qika Lin, Xiaobao Wu, Xinyu Mu, Wenhao Liu, Meina Song, Yifan Zhu, Anh Tuan Luu

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results show that KBQA-o1 outperforms previous low-resource KBQA methods with limited annotated data, boosting Llama-3.1-8B model s Grail QA F1 performance to 78.5% compared to 48.5% of the previous sota method with GPT-3.5-turbo. Our code is publicly available at https://github. com/LHRLAB/KBQA-o1. ... We perform experiments on three KBQA datasets, Grail QA (Gu et al., 2021), Web QSP (Yih et al., 2016) and Graph Q (Su et al., 2016) in low-resource settings (Li et al., 2023) for application with limited annotated data. Experimental results demonstrate that KBQA-o1 outperforms existing low-resource KBQA methods and even approaches or surpasses the performance of fully supervised KBQA models, especially in more difficult cases like compositional and zero-shot. Ablation studies further validate the proposed MCTS-based agent process and incremental fine-tuning, both of which make KBQA-o1 outperform other forms of KBQA methods, as shown in Figure 2.
Researcher Affiliation	Academia	Haoran Luo 1 2 Haihong E 1 Yikai Guo 3 Qika Lin 4 Xiaobao Wu 2 Xinyu Mu 1 Wenhao Liu 1 Meina Song 1 Yifan Zhu 1 Luu Anh Tuan 2 1Beijing University of Posts and Telecommunications 2Nanyang Technological University 3Beijing Institute of Computer Technology and Application 4National University of Singapore. Correspondence to: Haihong E <EMAIL>.
Pseudocode	Yes	Figure 8 and Algorithm 1 illustrate the Monte Carlo Tree Search (MCTS) process in KBQA-o1. The figure highlights the four stages of MCTS: Selection, where nodes are chosen using the Upper Confidence Bound for Trees (UCT) to balance exploration and exploitation; Expansion, where candidate actions are generated by the policy model, filtered for relevance to the knowledge base, and added as child nodes; Simulation, where the most promising path is explored to produce a complete logical form and compute rewards; and Back-propagation, where rewards are propagated back to update Q-values and visit counts. The pseudocode formalizes this process, iteratively performing rollouts that follow the four stages.
Open Source Code	Yes	Our code is publicly available at https://github. com/LHRLAB/KBQA-o1.
Open Datasets	Yes	We perform experiments on three KBQA datasets, Grail QA (Gu et al., 2021), Web QSP (Yih et al., 2016) and Graph Q (Su et al., 2016) in low-resource settings (Li et al., 2023) for application with limited annotated data.
Dataset Splits	Yes	Following KB-BINDER (Li et al., 2023), we conduct 40-shot experiments for Grail QA, and 100-shot for Web QSP and Graph Q. ... Table 6. Dataset statistics of KBQA-o1. I.I.D Compositional Zero-shot Grail QA Web QSP Graph Q #Train 40 100 100 #Exploration 43851 2929 2332 #Test 1564 1487 3645 6696 1566 2319
Hardware Specification	Yes	All experiments are done on 8 NVIDIA A40 GPUs (48GB), with results averaged from three randomly seeded experiments.
Software Dependencies	No	The paper mentions several LLMs used (Llama-3, Qwen2.5, Gemma-2) and a model for semantic similarity (Sim CSE), but does not provide specific version numbers for ancillary software dependencies such as programming languages, libraries, or frameworks (e.g., Python, PyTorch, TensorFlow, CUDA).
Experiment Setup	Yes	During the MCTS exploration phase, we set θexp with w = 50, while in the prediction phase, we set θeff with w = 10. We select multiple open-source 7B-72B LLMs, including Llama-3 (Dubey et al., 2024), Qwen2.5 (Yang et al., 2025) and Gemma-2 (Team et al., 2024), to construct KBQA-o1. ... Appendix G shows the optimal hyperparameter settings. ... Table 7 presents the hyperparameter configurations for the KBQA-o1 across three datasets: Grail QA, Web QSP, and Graph Q. These parameters are categorized into four stages: Initial Few-shot SFT, MCTS Exploration Stage, Incremental Fine-tuning, and MCTS Prediction Stage, each designed to optimize the KBQA framework s performance for different tasks.