Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Mitigating Overthinking in Large Reasoning Models via Manifold Steering

Authors: Yao Huang, Huanran Chen, Shouwei Ruan, Yichi Zhang, Xingxing Wei, Yinpeng Dong

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on Deep Seek-R1 distilled models validate that our method reduces output tokens by up to 71% while maintaining and even improving the accuracy on several mathematical benchmarks. Our method also exhibits robust cross-domain transferability, delivering consistent token reduction performance in code generation and knowledge-based QA tasks.
Researcher Affiliation	Academia	1Institute of Artificial Intelligence, Beihang University, Beijing 100191, China 2College of AI, Tsinghua University, Beijing 100084, China 3Shanghai Qi Zhi Institute 4State Key Laboratory of Virtual Reality Technology and Systems, Beihang University : EMAIL, EMAIL
Pseudocode	No	The paper describes methods through textual descriptions and mathematical equations like Eq. (1), (2), (3), (5), (6), (7), (8), (9), and (10), but does not contain a clearly labeled 'Pseudocode' or 'Algorithm' block.
Open Source Code	Yes	Code is available at: https://github.com/Aries-iai/Manifold_Steering.
Open Datasets	Yes	Extensive experiments on multiple Deep Seek-R1 distilled models [14] of different sizes verify the effectiveness of our manifold steering method. We first test it on mathematical datasets of varying difficulty, including GSM8K [10], Math500 [20], AMC2023 [24], and AIME2024 [25]. To further verify the transferability, we use Live Code Bench [18] for code generation and GPQA-Diamond [32] for expert-level disciplinary knowledge.
Dataset Splits	Yes	Evaluation Datasets & Metrics. To evaluate the effectiveness of Manifold Steering, we include mathematical datasets of varying difficulty: GSM8K [10], MATH500 [20], AMC2023 [24], and AIME2024 [25]. To further verify the transferability, we use Live Code Bench [18] for code generation and GPQA-Diamond [32] for expert-level disciplinary knowledge. All datasets are evaluated using Pass@1 as the task-solving metric and the average token count (#Tokens) for overthinking mitigation. Implementation Details. The data for computing steering directions is filtered using the method outlined in Sec. 3 on the Open Math Instruct2 dataset [36].
Hardware Specification	Yes	All experiments are conducted on an Ubuntu 22.04 system with A800 GPUs.
Software Dependencies	No	The paper mentions 'Ubuntu 22.04' as the operating system, but does not specify programming language versions or library versions (e.g., Python, PyTorch/TensorFlow, CUDA versions) used for the implementation.
Experiment Setup	Yes	All models use recommended settings: temperature of 0.6, top-p of 0.95, and a maximum token limit of 16,384. We specify the layer used to compute the steering direction and the intervention strength α as follows: R1-1.5B (layer 27, α = 0.7), R1-7B (layer 27, α = 0.3), R1-8B (layer 31, α = 0.5), and R1-14B (layer 47, α = 0.3). During inference, this direction is applied to all layers as stated in Eq. (3).