Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Let LRMs Break Free from Overthinking via Self-Braking Tuning

Authors: Haoran Zhao, Yuchen Yan, Yongliang Shen, Haolei Xu, Wenqi Zhang, Kaitao Song, Jian Shao, Weiming Lu, Jun Xiao, Yueting Zhuang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct extensive experiments to evaluate the effectiveness of Self-Braking Tuning across various model architectures and mathematical reasoning tasks. Our evaluation aims to answer three key questions: (1) How effectively does SBT reduce token consumption while preserving accuracy? (2) How does performance vary across different model sizes and architectures? (3) How do the two SBT variants (SBT-E and SBT-D) compare in practice?
Researcher Affiliation	Collaboration	1 Zhejiang University, 2 Tianjin University, 3 Microsoft Research Asia
Pseudocode	Yes	Algorithm 1 Self-Braking Tuning Exact (SBT-E) ... Algorithm 2 Self-Braking Tuning Dynamic (SBT-D)
Open Source Code	Yes	Git Hub: https://github.com/ZJU-REAL/Self-Braking-Tuning Project: https://zju-real.github.io/SBT/
Open Datasets	Yes	We curate a dataset of 92K high-quality instances from Open R1-Math [1]... [1] Open R1 Team. Open R1-Math-220k Dataset. https://huggingface.co/datasets/open-r1/ Open R1-Math-220k, 2025.
Dataset Splits	Yes	Evaluation benchmarks. We evaluate performance across four mathematical reasoning benchmarks of varying difficulty: AIME (24&25, competition-level algebraic problems), AMC23 (precollegiate mathematics), MATH500 [44, 45] (diverse mathematical problems), and GSM8K [47] (grade school math word problems).
Hardware Specification	Yes	Training is conducted on 64 Ascend H910B-64G hardware. ... All inference are performed on NVIDIA A100 GPUs.
Software Dependencies	No	All models are trained using Megatron-LM for 3 epochs with a 1e-5 initial learning rate, cosine decay schedule, 0.03 warm-up ratio, and 16,384-token maximum sequence length.
Experiment Setup	Yes	All models are trained using Megatron-LM for 3 epochs with a 1e-5 initial learning rate, cosine decay schedule, 0.03 warm-up ratio, and 16,384-token maximum sequence length. Training is conducted on 64 Ascend H910B-64G hardware. ... For inference, we use v LLM [48] with temperature 0.7, generating 8 samples per question and reporting Average accuracy. All inference are performed on NVIDIA A100 GPUs.