Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

RAGRouter: Learning to Route Queries to Multiple Retrieval-Augmented Language Models

Authors: Jiarui Zhang, Xiangyu Liu, Yong Hu, Chaoyue Niu, Fan Wu, Guihai Chen

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on diverse knowledge-intensive tasks and retrieval settings, covering open and closed-source LLMs, show that RAGRouter outperforms the best individual LLM and existing routing methods. We evaluate RAGRouter on a suite of knowledge-intensive tasks [32, 34, 25, 2, 23] and retrieval settings.
Researcher Affiliation	Collaboration	Jiarui Zhang1 Xiangyu Liu2 Yong Hu2 Chaoyue Niu1 Fan Wu1 Guihai Chen1 1Shanghai Jiao Tong University 2We Chat, Tencent Inc
Pseudocode	No	The paper describes the model architecture and optimization steps verbally and through mathematical equations and figures, but it does not include an explicit pseudocode block or algorithm section.
Open Source Code	Yes	The code and data are available at https://github.com/OwwO99/RAGRouter.
Open Datasets	Yes	We select queries from five different knowledge-intensive tasks: (i) Pop QA [32] is an open-domain question-answering benchmark; (ii) Med MCQA [34] is a multiple-choice benchmark focused on biomedical knowledge; (iii) Natural Questions (NQ) [25] is an open-domain benchmark; (iv) Web Questions (Web Q) [2] is a knowledge base-driven benchmark; and (v) Trivia QA (TQA) [23] is an open-domain benchmark. Local retrieval uses the 2018 English Wikipedia dump [24] with BGE-large-en-v1.5 [50] as the dense retriever. Online retrieval leverages the Duck Duck Go Web Search API 4 to access up-to-date external content.
Dataset Splits	Yes	As shown in Table 8, we randomly sampled queries from five knowledge-intensiv tasks (Pop QA [32], Med MCQA [34], NQ [25], Web Q [2], and Trivia QA [23]) and partitioned them into training and test sets. For Pop QA and Med MCQA: Train 2000, Test 270. For NQ, WQ, TQA: Train 1000, Test 240.
Hardware Specification	Yes	All experiments are conducted on a single NVIDIA RTX 4090D GPU.
Software Dependencies	No	The paper mentions using specific models like "all-mpnet-base-v2" and "ms-marco-Mini LM-L12-v2" as encoders, "Adam W" as an optimizer, and the "v LLM framework" for local deployment. However, it does not provide specific version numbers for these software components or any programming languages/libraries.
Experiment Setup	Yes	For the RAGRouter architecture, we use all-mpnet-base-v2 [37] as the encoder for both queries and documents, and ms-marco-Mini LM-L12-v2 [46] as the cross-encoder, resulting in a total parameter size of approximately 136M. Both the knowledge representation vector and the RAG capability vector are set to a dimensionality of 768. To mitigate overfitting, all but the last two transformer layers in the query/document encoder and the cross-encoder are frozen during training. The classification loss weight λ is set to 2.0, and the contrastive learning temperature τ to 0.2. The router is optimized using Adam W [30] with a learning rate of 5e-5, batch size of 64, for 10 epochs.