Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Thinkless: LLM Learns When to Think

Authors: Gongfan Fang, Xinyin Ma, Xinchao Wang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically, on several benchmarks such as Minerva Algebra, MATH-500, and GSM8K, Thinkless is able to reduce the usage of long-chain thinking by 50% 90%, significantly improving the efficiency of Reasoning Language Models.
Researcher Affiliation Academia Gongfan Fang Xinyin Ma Xinchao Wang National University of Singapore EMAIL, EMAIL, EMAIL
Pseudocode No The paper describes the Decoupled Group Relative Policy Optimization (De GRPO) algorithm by detailing its objective function and components, but it does not include a formal pseudocode block or algorithm listing.
Open Source Code Yes The code is available at https://github.com/Vain F/Thinkless
Open Datasets Yes For evaluation, we mainly focus on math datasets, including AIME [37], Minerva Algebra [15], MATH-500 [22] and GSM-8K [8]. For the reinforcement learning stage, we primarily use the Deep Scale R dataset [25], which comprises approximately 40K labeled examples.
Dataset Splits No The paper uses the Open Thoughts2-1M dataset for the warm-up stage, the Deep Scale R dataset for reinforcement learning, and AIME, Minerva Algebra, MATH-500, and GSM8K for evaluation. However, it does not provide explicit training/test/validation splits (e.g., percentages or specific sample counts for splits) for any of these datasets.
Hardware Specification Yes All experiments were conducted on a single node with 4 H100 GPUs.
Software Dependencies No The SFT was conducted on the Megatron framework [32]. The RL experiments were implemented using the Ve RL framework [31].
Experiment Setup Yes For the warm-up stage, we set the maximum context length to 16K... The model is trained for a single full epoch... The reinforcement learning stage... The model was trained only for 600 steps, using the Adam W optimizer with a learning rate of 1 10 6, β = (0.9, 0.999), and a weight decay of 0.01. The batch size is set to 128, with 8 responses sampled for each query...