Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

SpecReason: Fast and Accurate Inference-Time Compute via Speculative Reasoning

Authors: Rui Pan, Yinwei Dai, Zhihao Zhang, Gabriele Oliaro, Zhihao Jia, Ravi Netravali

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Across a variety of cross-domain reasoning benchmarks, Spec Reason achieves 1.4-3.0 speedup over vanilla LRM inference while improving accuracy by 0.4-9.0%. Compared to speculative decoding without Spec Reason, their combination yields an additional 8.8-58.0% latency reduction. We evaluate Spec Reason across a wide range of reasoning workloads spanning tasks of varying complexity [aim, 2025, Hendrycks et al., 2021, Rein et al., 2024].
Researcher Affiliation	Academia	Princeton University Carnegie Mellon University EMAIL, EMAIL, EMAIL, EMAIL
Pseudocode	No	The paper describes the system and its components but does not provide structured pseudocode or an algorithm block.
Open Source Code	Yes	We open-source Spec Reason at https://github.com/ruipeterpan/specreason.
Open Datasets	Yes	We evaluate Spec Reason on three diverse reasoning benchmarks: AIME [aim, 2025] for high-school competition-level mathematical problems, MATH500 [Hendrycks et al., 2021] for high-school competition-level mathematical problems sampled from AMC 10, AMC 12, and AIME, and GPQA Diamond [Rein et al., 2024] for graduate-level questions in general domains like biology, physics, and chemistry.
Dataset Splits	No	The paper mentions using AIME, MATH500, and GPQA Diamond datasets for evaluation, but it does not explicitly provide details on how these datasets were split into training, validation, or test sets.
Hardware Specification	Yes	Hardware. We run our evaluations on two NVIDIA A6000-48GB GPUs. We use v LLM 0.8.2 as the underlying inference engine and enable prefix caching [Kwon et al., 2023, Zheng et al., 2023, Pan et al., 2025]. Both models are served with a tensor parallelism degree of two. Given the size of the R1-70B model, we deploy it across four A100-80GB GPUs using a tensor parallelism degree of 4.
Software Dependencies	Yes	Hardware. We run our evaluations on two NVIDIA A6000-48GB GPUs. We use v LLM 0.8.2 as the underlying inference engine and enable prefix caching [Kwon et al., 2023, Zheng et al., 2023, Pan et al., 2025]. Both models are served with a tensor parallelism degree of two.
Experiment Setup	Yes	Similar to prior work [Guo et al., 2025], we set k=16 when calculating pass@1 i.e., we generate 16 responses with temperature=0.6 for every query and calculate the average accuracy and set the token budget to be 8192 tokens to ensure an apples-to-apples comparison between baselines. During the base model s evaluation of each reasoning step, we vary the acceptance threshold for the utility score between 3, 5, 7, and 9, and report the resulting accuracy and latency. In Fig. 6, we also study the effect of the alternative knob, forcing the first n reasoning steps to be decoded by the base model, on the accuracy-latency tradeoff. As we change n from 0 to 10, 20, 30, and 40, Spec Reason s accuracy increases from 33.2% to 37.3% while the latency increases from 270.4s to 292.6s, showcasing an alternative approach to improve accuracy with a slight increase in latency.