Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Controlling Thinking Speed in Reasoning Models

Authors: Zhengkai Lin, Zhihang Fu, Ze Chen, Chao Chen, Liang Xie, Wenxiao Wang, Deng Cai, Zheng Wang, Jieping Ye

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Without any training or additional cost, our plug-in module delivers an average +1.3% accuracy with -8.6% token usage across leading LRMs and advanced reasoning benchmarks. All of our algorithms are implemented based on v LLM and are expected to support broader applications and inspire future research. Experimental settings We experiment with 2 widely-used LRMs, namely Deep Seek-R1-Distill Qwen-7B and Deep Seek-R1-Distill-Qwen-32B [7]. We evaluate these models on AIME24 [21], MATH-500 [18], GPQA Diamond [27] and Live Code Bench (release_v2) [12]. These benchmarks cover various reasoning skills across multiple disciplines, including math, biology, physics, chemistry, and coding.
Researcher Affiliation Collaboration Zhengkai Lin1, Zhihang Fu2 , Ze Chen2, Chao Chen2, Liang Xie3, Wenxiao Wang4 , Deng Cai1, Zheng Wang2, Jieping Ye2 1State Key Lab of CAD&CG, Zhejiang University, 2Alibaba Cloud 3College of Computer Science and Technology, Zhejiang University of Technology 4School of Software Technology, Zhejiang University
Pseudocode Yes A.1 Details on sliding-window based adaptive control algorithm Algorithm 1 Sliding-window based Adaptive Control Algorithm
Open Source Code Yes 1Code available at: https://github.com/D2I-ai/thinking-speed-control
Open Datasets Yes We evaluate these models on AIME24 [21], MATH-500 [18], GPQA Diamond [27] and Live Code Bench (release_v2) [12]. For representation reading, we use only the MATH training set (7.5k math problems) to sample both fast and slow responses from the LRMs. We determine the optimal bucket for math reasoning, we evaluate model performance across different buckets using the AMC problem set [17] as validation data.
Dataset Splits Yes For representation reading, we use only the MATH training set (7.5k math problems) to sample both fast and slow responses from the LRMs. After filtering out incorrect responses sampled from the MATH training set, we used approximately 6k stimulus pairs for representation collection. We randomly chose 4k pairs to compute the difference vectors, with half of the pairs to compute directional vectors d( +) i and the other half to compute the reversed-direction vectors d(+ ) j . The remaining 2k pairs from the MATH set served as a validation set
Hardware Specification Yes We run all the experiments with NVIDIA A100 80GB GPUs.
Software Dependencies No All of our algorithms are implemented based on v LLM and are expected to support broader applications and inspire future research. For all experiments, we use v LLM [15] and set the maximum generation length to 32,768 tokens.
Experiment Setup Yes During sampling, we use a temperature of 0.6, a top-p value of 0.95, and generate 8 responses per query to calculate Pass@1 (i.e., accuracy). We sweep the steering intensity α in Equation (1) across positive and negative values to evaluate the LRMs performances when their thought processes are accelerated or decelerated. We set the sliding window size as k = 8 tokens. The outlier detection threshold λ is set to 2.0 for the 7B and 8B models and 1.5 for the 32B model. The steering intensity α is constrained to the range [ 4, 4].