Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

R2R: Efficiently Navigating Divergent Reasoning Paths with Small-Large Model Token Routing

Authors: Tianyu Fu, Yi Ge, Yichen You, Enshu Liu, Zhihang Yuan, Guohao Dai, Shengen Yan, Huazhong Yang, Yu Wang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We apply R2R to combine R1-1.5B and R1-32B models from the Deep Seek family, and evaluate on challenging math, coding, and QA benchmarks. With an average activated parameter size of 5.6B, R2R surpasses the average accuracy of R1-7B by 1.6 , outperforming even the R1-14B model.
Researcher Affiliation	Collaboration	Tianyu Fu 1,2, Yi Ge 1, Yichen You1, Enshu Liu1, Zhihang Yuan2, Guohao Dai3,2, Shengen Yan2, Huazhong Yang1, Yu Wang 1 1Tsinghua University 2Infinigence AI 3Shanghai Jiao Tong University
Pseudocode	Yes	Algorithm 1 Path-Following Routing Input: Partial sequence S<i, models {θs, θl} Output: Selected model mi 1: ys arg maxy Pθs(y \| S<i) 2: yl arg maxy Pθl(y \| S<i) 3: if ys = yl then 4: mi θs identical 5: else 6: Ss CONTINUATION(S<i, ys) 7: Sl CONTINUATION(S<i, yl) 8: if Je(Ss, Sl) = 0 then 9: mi θs neutral 10: else 11: mi θl divergent 12: end if 13: end if 14: return mi
Open Source Code	Yes	Our code is available at https://github.com/thu-nics/R2R.
Open Datasets	Yes	We evaluate methods across challenging reasoning benchmarks, including mathematics (AIME 2024 2025 [10]; denoted as AIME), graduate-level question-answering (GPQADiamond [36]; denoted as GPQA), and coding tasks (Live Code Bench 2024-08 2025-01; denoted as Live Code Bench [37]). All experiments use a maximum output length of 32K tokens and zero generation temperature to ensure reproducibility. ... Our training data for the router are sourced from tasks across three distinct scenarios: mathematics, code, and question answering (QA). The mathematics problems are drawn from the American Invitational Mathematics Examination (AIME) [10], covering the years 1983 to 2022. Code and QA problems are sampled from Bespoke-Stratos-17k dataset [31].
Dataset Splits	Yes	Our training data for the router are sourced from tasks across three distinct scenarios: mathematics, code, and question answering (QA). The mathematics problems are drawn from the American Invitational Mathematics Examination (AIME) [10], covering the years 1983 to 2022. Code and QA problems are sampled from Bespoke-Stratos-17k dataset [31]. ... Our validation dataset are constructed in the exact same way as the training data, but with different queries. The validation dataset comprises all 30 problems from AIME 2023, 69 coding problems from the Bespoke-Stratos-17k dataset that are excluded from the training set, and 60 QA problems selected from the GPQA-Extended [36] dataset.
Hardware Specification	Yes	We efficiently generate 7.6 million routing labels in approximately 2.3 days on 8 A800 GPUs, covering topics of math, coding, and QA with queries from the Bespoke-Stratos [31] dataset. ... All baselines use the official, highly efficient SGLang [29] framework and are evaluated with tensor parallelism on two NVIDIA A800-80GB GPUs.
Software Dependencies	No	All baselines use the official, highly efficient SGLang [29] framework and are evaluated with tensor parallelism on two NVIDIA A800-80GB GPUs. ... The SLM is loaded onto a single GPU (GPU 1) using the SGLang scheduler, with the mem_fraction_static set to 0.15. The LLM employs tenser-parallel inference distributed across two GPUs (GPU 0 and GPU 1) via SGLang schedulers managed by Py Torch s distributed multiprocessing framework with the NCCL backend, with the mem_fraction_static set to 0.80.
Experiment Setup	Yes	During training, we employ the Adam W optimizer with hyperparameters β1 = 0.9 and β2 = 0.999. The learning rate is set to 5 10 5, with a dropout rate of 0.1 and a weight decay of 5 10 4. We train the neural network with float32 precision. The router is trained for up to 50 epochs using a batch size of 1024, with early stopping applied based on a patience of 10 epochs. Validation is performed at every epoch. We adopt the checkpoint corresponding to the best-performing epoch on the validation set as the final router used. ... All experiments use a maximum output length of 32K tokens and zero generation temperature to ensure reproducibility.