Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

ReMindRAG: Low-Cost LLM-Guided Knowledge Graph Traversal for Efficient RAG

Authors: Yikuan Hu, Jifeng Zhu, Lanrui Tang, Chen Huang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To evaluate REMINDRAG, we conduct extensive experiments across various benchmark datasets and LLM backbones. The experimental results demonstrate that REMINDRAG exhibits a clear advantage over competing baseline approaches, achieving performance gains of 5% to 10% while simultaneously reducing the average cost per query by approximately 50%.
Researcher Affiliation	Academia	1College of Computer Science, Sichuan University, China 2School of Computing and Data Science, The University of Hong Kong, Hong Kong, China 3Institute of Data Science, National University of Singapore, Singapore
Pseudocode	Yes	Algorithm 1 LLM-Guided KG Traversal
Open Source Code	Yes	Our code is available at https://github.com/kilgrims/Re Mind RAG.
Open Datasets	Yes	1) Long Dependency QA: We resort to the Loo GLE dataset [24]. 2) Multi-Hop QA: We resort to the Hotpot QA dataset [49]. 3) Simple QA: This task leans towards traditional retrieval tasks, emphasizing the model s ability to extract directly associated information from local contexts. We adopt the Short Dependency QA from the Loo GLE [24] dataset as a representative example.
Dataset Splits	Yes	Here, given a dataset A containing documents and user queries, we consider three setups: (1) Same Query: The model initially has already been evaluated on dataset A and has memorized information from A and is subsequently re-evaluated again on the same dataset. (2) Similar Query: The model is re-evaluated again on a dataset A whose queries are semantically equivalent paraphrases of those in A (cf. Appendix C.3 for implementation). (3) Different Query: The model is re-evaluated again on a dataset A whose queries are distinct from those in A but share similar questions (cf. Appendix C.4 for implementation).
Hardware Specification	Yes	Experiments were executed on a dedicated research workstation configured with an AMD Ryzen 7 7800X3D 8-Core Processor, an NVIDIA Ge Force RTX 4070 Ti SUPER GPU, and 64 GB of DDR5 RAM.
Software Dependencies	No	All operations utilizing large language models (GPT-4o, GPT-4o-mini, Deepseek-V3) are completed by invoking APIs. Notably, the proposed experiments can be readily replicated on ordinary machines, ensuring broad accessibility and reproducibility for the research community. Additionally, all dense embedding computations employed the "nomic-ai/nomic-embed-text-v2-moe" model [33] as the foundational embedding model. For tokenization operations, we uniformly adopted the "nomic-ai/nomic-embed-text-v2moe" as tokenizer.
Experiment Setup	Yes	The random seed for the large language model was fixed to 123, and the temperature parameter was set to 0 to eliminate randomness in the generation process. Additionally, GPT4o is employed as our LLM-based evaluator. Furthermore, all dense embedding computations employed the "nomic-ai/nomic-embed-text-v2-moe" model [33] as the foundational embedding model. For tokenization operations, we uniformly adopted the "nomic-ai/nomic-embed-text-v2moe" as tokenizer, with all token-based chunks standardized to 750 tokens in length. For detailed parameter settings of other experiments, please refer to Appendix D.4. ... Node Correlation Weight (α): ... In this study, we adopt α = 0.1. Strong Connection Threshold (λ): ... In the experiments of this paper, we select this value as 0.55. ... Synonym Similarity Threshold: Our experiments employ 0.7 as the default value. Maximum Hop Count: ... We set this parameter to 10 ... Question Decomposition Limit: ... we set this to 1 ... Initial Seed Node Count: ... We configure this as 2 seed nodes