Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Let LRMs Break Free from Overthinking via Self-Braking Tuning

Authors: Haoran Zhao, Yuchen Yan, Yongliang Shen, Haolei Xu, Wenqi Zhang, Kaitao Song, Jian Shao, Weiming Lu, Jun Xiao, Yueting Zhuang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct extensive experiments to evaluate the effectiveness of Self-Braking Tuning across various model architectures and mathematical reasoning tasks. Our evaluation aims to answer three key questions: (1) How effectively does SBT reduce token consumption while preserving accuracy? (2) How does performance vary across different model sizes and architectures? (3) How do the two SBT variants (SBT-E and SBT-D) compare in practice?
Researcher Affiliation Collaboration 1 Zhejiang University, 2 Tianjin University, 3 Microsoft Research Asia
Pseudocode Yes Algorithm 1 Self-Braking Tuning Exact (SBT-E) ... Algorithm 2 Self-Braking Tuning Dynamic (SBT-D)
Open Source Code Yes Git Hub: https://github.com/ZJU-REAL/Self-Braking-Tuning Project: https://zju-real.github.io/SBT/
Open Datasets Yes We curate a dataset of 92K high-quality instances from Open R1-Math [1]... [1] Open R1 Team. Open R1-Math-220k Dataset. https://huggingface.co/datasets/open-r1/ Open R1-Math-220k, 2025.
Dataset Splits Yes Evaluation benchmarks. We evaluate performance across four mathematical reasoning benchmarks of varying difficulty: AIME (24&25, competition-level algebraic problems), AMC23 (precollegiate mathematics), MATH500 [44, 45] (diverse mathematical problems), and GSM8K [47] (grade school math word problems).
Hardware Specification Yes Training is conducted on 64 Ascend H910B-64G hardware. ... All inference are performed on NVIDIA A100 GPUs.
Software Dependencies No All models are trained using Megatron-LM for 3 epochs with a 1e-5 initial learning rate, cosine decay schedule, 0.03 warm-up ratio, and 16,384-token maximum sequence length.
Experiment Setup Yes All models are trained using Megatron-LM for 3 epochs with a 1e-5 initial learning rate, cosine decay schedule, 0.03 warm-up ratio, and 16,384-token maximum sequence length. Training is conducted on 64 Ascend H910B-64G hardware. ... For inference, we use v LLM [48] with temperature 0.7, generating 8 samples per question and reporting Average accuracy. All inference are performed on NVIDIA A100 GPUs.