Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Lookahead Routing for Large Language Models
Authors: Canbin Huang, Tianyuan Shi, Yuhua Zhu, Ruijun Chen, Xiaojun Quan
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirical evaluations across seven public benchmarks spanning instruction following, mathematical reasoning, and code generation show that Lookahead consistently outperforms existing routing baselines, achieving an average performance gain of 7.7% over the state-of-the-art. |
| Researcher Affiliation | Academia | 1School of Computer Science and Engineering, Sun Yat-sen University, China 2Shenzhen Loop Area Institute, China EMAIL EMAIL |
| Pseudocode | No | The paper describes the methodology in prose and mathematical formulations, but there are no explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our code is available at https://github.com/huangcb01/lookahead-routing. |
| Open Datasets | Yes | We construct a heterogeneous training corpus by aggregating prompts from three publicly available sources spanning diverse domains: (i) Ultra Feedback [12]... (ii) Open Math Instruction-2 [36]... (iii) Self-Oss-Instruct-SC2 [25]... We evaluate routing performance on seven public benchmarks spanning three task types: (i) Instruction-following: Alpaca Eval-2 [14], Arena-Hard [23], and MT-Bench [47]. (ii) Mathematics: GSM8K [11] and MATH [19]. (iii) Coding: Human Eval [8] and MBPP [3]. |
| Dataset Splits | Yes | In Table 3, we provide the datasets from which queries are sampled to construct the training set along with the percentage of highest-scoring responses per candidate LLM. Table 3 lists Sample Counts for Train and Validation splits for Ultra Feedback, Open Math Instruction-2, Self-Oss-Instruct-SC2, and Overall datasets. |
| Hardware Specification | Yes | We conducted routing experiments with a batch size of 64 and a maximum length of 2048 tokens on a single 24GB NVIDIA RTX 3090 GPU. |
| Software Dependencies | No | The paper mentions using specific model backbones (Smol LM2-135M and Modern BERT-base) and the Adam W optimizer, but does not provide specific version numbers for software libraries or frameworks like Python, PyTorch, or the Hugging Face Transformers library. |
| Experiment Setup | Yes | For CLM-based Lookahead, we finally set λ to 0.5. For the MLM-based varient, we set λ to 0.2, m to 64, and α to 0.4... We conducted routing experiments with a batch size of 64 and a maximum length of 2048 tokens... The training was performed on 2 and 4 epochs for CLM- and MLM-based implementations, respectively. A cosine learning rate schedule and the Adam W optimizer are employed with a learning rate of 5e-5. |