Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Scaling Speculative Decoding with Lookahead Reasoning
Authors: Yichao Fu, Rui Ge, Zelei Shao, Zhijie Deng, Hao Helen Zhang
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct extensive experiments showing consistent performance improvements across diverse datasets. (From Section 1, Contributions) ; Section 4: Experiment |
| Researcher Affiliation | Academia | 1UCSD 2Shanghai Jiao Tong University |
| Pseudocode | Yes | Algorithm 1 Lookahead Reasoning(Sync Version) |
| Open Source Code | Yes | Our code is available at https://github. com/hao-ai-lab/Lookahead Reasoning |
| Open Datasets | Yes | Our evaluation spans a suite of benchmarks... For code generation, we use Human Eval [14] and Live Code Bench [15]. Math reasoning tasks are assessed using GSM8K [16], AIME 24 [17], and AMC12 23 [18]. For question answering, we include GPQA [19] and MT-Bench [7]. |
| Dataset Splits | Yes | Specific to dataset sampling, we utilize 40 out of 50 problems from AMC12 23, selected by Qwen2.5 Math [20], and randomly sample 100 queries from the 1.3K GSM8K test set. For Live Code Bench, We select 268 problems collected between August 2024 and Janaury 2025, following previous research [4]. |
| Hardware Specification | Yes | Experiments are conducted on a server equipped with eight NVIDIA H100 GPUs. Target models (32B) are deployed across two H100 GPUs using tensor parallelism. Draft models (1.5B/1.7B) and the default judge model (Qwen2.5-7B-Instruct) are each deployed on a single H100 GPU. |
| Software Dependencies | Yes | Our algorithm is built upon the v LLM v0.8.3. |
| Experiment Setup | Yes | For the Deep Seek-R1-Distill series, we adhere to the official settings with a temperature of 0.6, top_p of 0.95, and a maximum generation length of 32K. For the Qwen3 series, the temperature is set to 0.6, top_p to 0.95, min_p to 0, top_k to 20, and the maximum generation length is 37K. ... The number of speculative tokens is set to 8 for SD and the number of speculative steps is set to 6 for LOOKAHEAD REASONING by default. |