Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Training Language Models to Reason Efficiently

Authors: Daman Arora, Andrea Zanette

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on two open-weight large reasoning models demonstrate significant reductions in inference cost while preserving most of the accuracy. In this work, we propose to train large reasoning models to reason efficiently. We perform numerical experiments on two recently released open-weight large reasoning models, Deep Seek-R1-Distill-Qwen-1.5B and Deep Seek-R1-Distill-Qwen-7B Guo et al. (2025) and derive models with a substantial reduction in reasoning cost while approximately maintaining accuracy, see Figure 1 for a summary of our results. We observe that our training procedure allows us to gracefully navigate the compute-performance tradeoff curve, that is, reduce compute significantly with minimal loss in performance.
Researcher Affiliation Academia Daman Arora Carnegie Mellon University EMAIL Andrea Zanette Carnegie Mellon University EMAIL
Pseudocode No The paper describes the methodology using prose, mathematical equations, and diagrams (e.g., Figure 2: 'Pipeline depicting our method'). However, it does not include a clearly labeled 'Pseudocode', 'Algorithm', or structured block of steps formatted like code or an algorithm.
Open Source Code Yes All of our code and trained models are public at https://github.com/Zanette-Labs/efficient-reasoning.
Open Datasets Yes For post-training the model using our technique, we choose 3.2k prompts from the MATH, cn_k12, AIME, Ao PS and the Olympiad subsets of the Numina Math dataset LI et al. (2024). We report the training logs and also evaluate the models on three test datasets namely: GSM8K Cobbe et al. (2021a), which contains grade-school-level math problems, MATH500 Hendrycks et al. (2021) which is a standard benchmark containing harder problems than GSM8K, and The American Invitational Mathematics Examination (AIME) 2024, a competition-level dataset of challenging mathematical problems. Additionally, to verify the robustness of our training methodology to datasets other than those based on mathematics, we evaluate models on Common Sense QA and Logical Deduction from BIG-Bench (Srivastava et al., 2023).
Dataset Splits No The paper mentions using '3.2k prompts from the MATH, cn_k12, AIME, Ao PS and the Olympiad subsets of the Numina Math dataset' for post-training, and 'three test datasets namely: GSM8K, MATH500, and The American Invitational Mathematics Examination (AIME) 2024' for evaluation. While it specifies sample sizes for evaluation (e.g., 'For GSM8K, we set k = 1', 'for MATH500, we use k = 3', 'for AIME2024, we set k = 10'), it does not provide explicit training/validation/test splits (e.g., 80/10/10 percentages or specific sample counts for each split) from a common dataset, nor does it detail how the '3.2k prompts' were themselves split into training and validation sets.
Hardware Specification Yes For the 1.5B model, we use 4 GH200 GPUs on one low-density node and for the 7B model, we use 8 GH200 GPUs distributed across two low-density nodes (4 GPUs per node). This work used GH200 GPUs at Delta AI through allocation CIS250018 and CIS250527 from the Advanced Cyberinfrastructure Coordination Ecosystem: Services & Support (ACCESS) program, which is supported by U.S. National Science Foundation grants #2138259, #2138286, #2138307, #2137603, and #2138296.
Software Dependencies No We build on the Open RLHF codebase Hu et al. (2024). We use v LLM Kwon et al. (2023b) for efficient batch inference. Adam Kingma & Ba (2017) is used as the standard optimizer. While these software tools and optimizers are mentioned, specific version numbers are not provided for any of them.
Experiment Setup Yes For all models, we set the temperature to 0.6 as suggested in the model s card1 and set the token limit to 32K. For training the 1.5B, Ze RO Stage 2 Rajbhandari et al. (2020) is used and for the 7B, Ze RO Stage 3 with activation checkpointing is required to prevent out of memory errors. The training precision is set to bfloat16. We generate 8 responses for each prompt. For every iteration, 32 prompts are selected from the dataset and the global batch size is set to 128 which leads to 2 gradient steps per RL iteration. For the 1.5B, the learning rate is set to 5 10 6 and for the 7B, it is set to 2 10 6. For all experiments, Adam Kingma & Ba (2017) is used as the standard optimizer. We experiment with 4 values of α in the following range: 0.05, 0.1, 0.2 and 0.4. For all RL experiments, the value of the KL coefficient is set to 1 10 3. The experiments on both model take approximately 20 hours.