Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Can Knowledge-Graph-based Retrieval Augmented Generation Really Retrieve What You Need?

Authors: Junchi Yu, Yujie Liu, Jindong Gu, Philip H.S. Torr, Dongzhan Zhou

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate Graph Flow on STa RK benchmark, which includes real-world queries from multiple domains over text-rich KGs. Graph Flow outperforms strong KG-based RAG baselines including GPT-4o by 10% performance gain on both retrieval accuracy and diversity metrics.
Researcher Affiliation	Collaboration	1Department of Engineering Science, University of Oxford, UK 2Shanghai Artificial Intelligence Laboratory, China EMAIL, EMAIL, EMAIL, EMAIL, EMAIL
Pseudocode	No	The paper describes methods and processes in text, such as in Section 3 "Method" and Appendix B "Implementation of Graph Flow with LLMs", including prompt templates, but does not present any formal pseudocode or algorithm blocks with numbered steps.
Open Source Code	Yes	2Code is available https://github.com/Samyu0304/Graph Flow
Open Datasets	Yes	We employ the STa RK [74] benchmark to validate the retrieval quality of the proposed Graph Flow to support complex queries. STa RK is a recently proposed benchmark designed to evaluate the retrieval performance of KG-based RAG methods on text-rich KGs spanning three domains:
Dataset Splits	Yes	For every training step, we construct mini-batch of traditions between states to calculate the loss in Eq. 9. The training dynamic is shown in Figure 5. Here, training transition loss is calculated using the transition between non-terminal states. And training starting loss and training end loss are calculated using boundary condition F(s0) = F(s T ) = 0. Training total loss and eval loss are calculated on all the transitions between states on the training and evaluation dataset. Table 4: Parameters of Graph Flow training on STa RK benchmark. ... eval_ratio 0.8
Hardware Specification	Yes	We run all experiments on 8/16 NVIDIA-A800-SXM4-80GB GPUs and 56 Intel(R) Xeon(R) Platinum 8336C CPUs.
Software Dependencies	No	We use LLa MA3-8B-Instruct as the backbone LLM to implement Graph Flow. Specifically, we first employ the following flow prompt template to wrap the retrieval trajectory τ t at state st into a text sequence for flow estimation. ... We employ the TRL (Transformer Reinforcement Learning) package to fir SFT and PRM fine-tuning.
Experiment Setup	Yes	We training Graph Flow on these dataset for one epoch, other important parameters are shown in Table 4. Table 4: Parameters of Graph Flow training on STa RK benchmark. Accumulation steps 2 alpha 16 batch_size 1 num_gpu 8 depth_cutoff 6 doc_cutoff 400 eval_ratio 0.8 eval_step 100 lora_dropout 0.05 lr 1.00E-05 max_length 1024 n_epochs 1 num_exploration 4 r 32 window_size 3