Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

NeuroPath: Neurobiology-Inspired Path Tracking and Reflection for Semantically Coherent Retrieval

Authors: Junchen Li, Rongzheng Wang, Yihong Huang, Qizhi Chen, Jiasheng Zhang, Shuang Liang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conducted extensive experiments to evaluate Neuro Path and answer the following research questions (RQ): RQ1: How effective is Neuro Path? RQ2: How does Neuro Path demonstrate its advantages in semantic coherence and noise reduction? RQ3: Do all parts of our framework work? RQ4: What is its scalability to other small open-source models and to tasks of varying complexity?
Researcher Affiliation	Academia	Junchen Li1, Rongzheng Wang1, Yihong Huang1, Qizhi Chen1, Jiasheng Zhang1, Shuang Liang1 1Institute of Intelligent Computing, University of Electronic Science and Technology of China, Chengdu, China EMAIL, EMAIL
Pseudocode	Yes	Algorithm 1 outlines the Neuro Path workflow, which comprises three main stages: Static Indexing, Dynamic Path Tracking, and Post-retrieval Completion. Moreover, Figure 10 and Figure 11 provide illustrative examples of the Dynamic Path Tracking and Post-retrieval Completion processes, respectively.
Open Source Code	Yes	Code is available at https://github.com/Kenny Caty/Neuro Path.
Open Datasets	Yes	We selected three challenging multi-hop question answering datasets to evaluate our method: Mu Si Que [38], 2Wiki Multi Hop QA [16] and Hotpot QA [43]. All three datasets are used to evaluate open-domain multi-hop reasoning tasks, ranging from 2-hop to longer-hop reasoning scenarios.
Dataset Splits	Yes	We followed the settings of Hippo RAG [14], selected 1,000 questions from each dataset for evaluation, used all the documents from each selected dataset as the retrieval corpus. For each question, only a small number of supporting documents are involved to verify the retrieval performance.
Hardware Specification	Yes	All open-source LLMs in our experiments are deployed using v LLM [20] on NVIDIA Ge Force RTX 4090.
Software Dependencies	No	All open-source LLMs in our experiments are deployed using v LLM [20] on NVIDIA Ge Force RTX 4090. While vLLM is mentioned as a library, its specific version number is not provided, nor are specific versions for other libraries or frameworks.
Experiment Setup	Yes	For Neuro Path, we set the maximum number of reasoning hops to 2 and use a Zero-Shot prompting setup. For IRCo T and Iter-Ret Gen, we set the maximum number of iterations to 3 and follow the original paper settings, with the number of documents retrieved per iteration set to 2, 4, 6 and 5, respectively. We additionally select 1,500 questions from the original 2Wiki Multi Hop QA dataset and follow the same procedure as in the main experiments using Deep Seek-V3 [22]. The outputs are then used to fine-tune Llama-3.1-8B-Instruct [12]. Specifically, we fine-tune Llama-3.1-8B-Instruct using QLo RA with 8-bit quantization and Flash Attention-2 for efficiency. Lo RA is applied to all layers with a rank of 8, α = 16, and dropout = 0. We train for 3 epochs using the Adam W optimizer and a cosine learning rate schedule with a base learning rate of 5e-5.