Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Lookahead Routing for Large Language Models

Authors: Canbin Huang, Tianyuan Shi, Yuhua Zhu, Ruijun Chen, Xiaojun Quan

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirical evaluations across seven public benchmarks spanning instruction following, mathematical reasoning, and code generation show that Lookahead consistently outperforms existing routing baselines, achieving an average performance gain of 7.7% over the state-of-the-art.
Researcher Affiliation	Academia	1School of Computer Science and Engineering, Sun Yat-sen University, China 2Shenzhen Loop Area Institute, China EMAIL EMAIL
Pseudocode	No	The paper describes the methodology in prose and mathematical formulations, but there are no explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Our code is available at https://github.com/huangcb01/lookahead-routing.
Open Datasets	Yes	We construct a heterogeneous training corpus by aggregating prompts from three publicly available sources spanning diverse domains: (i) Ultra Feedback [12]... (ii) Open Math Instruction-2 [36]... (iii) Self-Oss-Instruct-SC2 [25]... We evaluate routing performance on seven public benchmarks spanning three task types: (i) Instruction-following: Alpaca Eval-2 [14], Arena-Hard [23], and MT-Bench [47]. (ii) Mathematics: GSM8K [11] and MATH [19]. (iii) Coding: Human Eval [8] and MBPP [3].
Dataset Splits	Yes	In Table 3, we provide the datasets from which queries are sampled to construct the training set along with the percentage of highest-scoring responses per candidate LLM. Table 3 lists Sample Counts for Train and Validation splits for Ultra Feedback, Open Math Instruction-2, Self-Oss-Instruct-SC2, and Overall datasets.
Hardware Specification	Yes	We conducted routing experiments with a batch size of 64 and a maximum length of 2048 tokens on a single 24GB NVIDIA RTX 3090 GPU.
Software Dependencies	No	The paper mentions using specific model backbones (Smol LM2-135M and Modern BERT-base) and the Adam W optimizer, but does not provide specific version numbers for software libraries or frameworks like Python, PyTorch, or the Hugging Face Transformers library.
Experiment Setup	Yes	For CLM-based Lookahead, we finally set λ to 0.5. For the MLM-based varient, we set λ to 0.2, m to 64, and α to 0.4... We conducted routing experiments with a batch size of 64 and a maximum length of 2048 tokens... The training was performed on 2 and 4 epochs for CLM- and MLM-based implementations, respectively. A cosine learning rate schedule and the Adam W optimizer are employed with a learning rate of 5e-5.