Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Router-R1: Teaching LLMs Multi-Round Routing and Aggregation via Reinforcement Learning

Authors: Haozhen Zhang, Tao Feng, Jiaxuan You

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on seven general and multi-hop QA benchmarks show that Router-R1 outperforms several strong baselines, achieving superior performance while maintaining robust generalization and cost management.
Researcher Affiliation Academia Haozhen Zhang1 Tao Feng Jiaxuan You University of Illinois at Urbana-Champaign EMAIL, EMAIL
Pseudocode No The paper describes the methodology in prose and includes a 'Training prompt template' in Figure 2, but it does not contain explicit pseudocode or algorithm blocks for the overall Router-R1 framework or its components.
Open Source Code Yes ulab-uiuc/Router-R1 Hugging Face Collection
Open Datasets Yes We evaluate Router-R1 on seven question-answering (QA) datasets, i.e., (1) General QA: Natural Question (NQ) [16], Trivia QA [14], Pop QA [19]; (2) Multi-Hop QA: Hotpot QA (Hp QA) [36], 2Wiki Multi Hop QA (2wiki) [9], Musique [32], and Bamboogle (Bamb) [24].
Dataset Splits Yes To incentivize both single-round and multi-round routing capabilities during training, we construct a joint dataset consisting of 7K samples each from the NQ and Hotpot QA datasets, respectively. This results in a 14K sample training set, which we find sufficient to induce effective routing strategies without requiring extensive data filtering or complex sampling procedures. As demonstrated in our experimental analysis in Section 5, this modestly sized dataset enables robust routing and aggregation behavior learning. After training, we evaluate in-domain performance on NQ and Hotpot QA datasets, where Router-R1 has seen similar data during training, and assess out-of-domain generalization performance across five other QA datasets mentioned above. For each evaluation dataset, we randomly sample 500 test instances (except for Bamboogle, which contains only around 120 test examples in total).
Hardware Specification Yes The base model training is conducted on NVIDIA A6000 GPUs, while routing LLMs are accessed via NVIDIA NIM APIs3.
Software Dependencies No The paper mentions that 'The model is trained using ve RL 2 for reinforcement learning in LLMs, employing the Proximal Policy Optimization (PPO) as the default algorithm', but does not provide specific version numbers for software libraries or environments like Python, PyTorch, or the 've RL' framework itself within the document's main text or appendices.
Experiment Setup Yes The model is trained using ve RL 2 for reinforcement learning in LLMs, employing the Proximal Policy Optimization (PPO) as the default algorithm. The batch size is set to 64, with a maximum of 225 training steps. The cost coefficient α is set to 0.0 in our main experiment unless otherwise specified. (From Appendix D, Table 8) Hyperparameter Value: Learning Rate (Actor) 1e-6, Learning Rate (Critic) 1e-5, Total Batch Size 64, Mini-batch Size 32, Micro-batch Size 8, Max Training Steps 225, Max Routing Steps 4, Max Sequence Length 4096, Max Response Length 1024, Max Length for LLM API Response 600, Tensor Parallel Size 1, GPU Utilization Ratio 0.6, Rollout Sampling Temperature (Train) 1.0, Rollout Sampling Temperature (Eval) 1.0.