Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

GFM-RAG: Graph Foundation Model for Retrieval Augmented Generation

Authors: Linhao Luo, Zicheng Zhao, Reza Haffari, Dinh Phung, Chen Gong, Shirui Pan

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on three multi-hop QA datasets and seven domain-specific RAG datasets demonstrate that GFM-RAG achieves state-of-the-art performance while maintaining efficiency and alignment with neural scaling laws, highlighting its potential for further improvement.
Researcher Affiliation	Academia	1Monash University, 2Nanjing University of Science and Technology, 3Shanghai Jiao Tong University, 4Griffith University, EMAIL, EMAIL, EMAIL, EMAIL
Pseudocode	No	The paper describes the methodology using mathematical equations and textual descriptions of the steps (e.g., section 3.2.1 Query-dependent GNN, 3.2.2 Training Process), but it does not include a dedicated pseudocode block or algorithm figure.
Open Source Code	Yes	Project page: https://rmanluo.github.io/gfm-rag
Open Datasets	Yes	We first evaluate the effectiveness of GFM-RAG on three widely-used multi-hop QA datasets, including Hotpot QA [72], Mu Si Que [63], and 2Wiki Multi Hop QA (2Wiki) [20]. We also evaluate the performance of GFM-RAG on seven RAG datasets from three domains, including biomedical [25], custom support [54, 44, 39, 4], and general knowledge [45, 27], to demonstrate the generalizability of GFM-RAG as the foundation model.
Dataset Splits	Yes	In experiments, we adhere to the official data split to obtain the training samples and follow existing methods [64, 16] to use the same 1,000 samples from each validation set to avoid data leakage. We merge the candidate passages as the document corpus for KG-index construction. The statistics of the training and test data are presented in Table 5 and Table 6, respectively.
Hardware Specification	Yes	The total parameters of the GFM retriever are 8M, which is trained on 8 NVIDIA A100s (80G) with batch size 4, learning rate 5e-4, and loss weight α = 0.3.
Software Dependencies	No	The paper mentions specific models like "all-mpnet-v2 [57]", "GPT-4o-mini [47]", and "Col BERTv2 [55]" but does not provide specific version numbers for general software dependencies or libraries (e.g., Python, PyTorch, CUDA versions).
Experiment Setup	Yes	The GFM retriever is implemented with 6 query-dependent message passing layers with the hidden dimension set to 512. The pre-trained all-mpnet-v2 [57] is adopted as the sentence embedding model and is frozen during training. The total parameters of the GFM retriever are 8M, which is trained on 8 NVIDIA A100s (80G) with batch size 4, learning rate 5e-4, and loss weight α = 0.3. The training data contains 60 KGs with over 14M triples constructed from 700k documents extracted from the training set. The statistics of training data are shown in Table 5, and the implementations are detailed in Appendix D.