Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Teaching Language Models to Reason with Tools

Authors: Chengpeng Li, Zhengyang Tang, Ziniu Li, Mingfeng Xue, Keqin Bao, Tian Ding, Ruoyu Sun, Benyou Wang, Xiang Wang, Junyang Lin, Dayiheng Liu

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experimental evaluations demonstrate Co RT s effectiveness, yielding absolute improvements of 4% and 8% on Deep Seek-R1-Distill-Qwen-32B and Deep Seek-R1-Distill-Qwen-1.5B, respectively, across five challenging mathematical reasoning datasets.
Researcher Affiliation	Collaboration	Chengpeng Li 1,2, Zhengyang Tang 2,3, Ziniu Li 3,4, Mingfeng Xue2, Keqin Bao1,2, Tian Ding4, Ruoyu Sun3,4, Benyou Wang3, Xiang Wang1, Junyang Lin2, and Dayiheng Liu 2 1University of Science and Technology of China 2Qwen Team, Alibaba Inc. 3The Chinese University of Hong Kong, Shenzhen 4Shenzhen International Center for Industrial and Applied Mathematics, Shenzhen Research Institute of Big Data
Pseudocode	Yes	The Iterative Hint-Engineering Loop Hint-engineering implements an iterative refinement procedure that converts imperfect reasoning trajectories into expert-aligned ones. For a problem instance P, let τ (i) denote the trajectory produced at iteration i. The loop operates as follows: 1. Initial generation (i = 0). The reasoner produces an initial trajectory τ (0) for P. 2. Annotation and evaluation. A human annotator reviews τ (i). If no deviation from the desired reasoning is detected, the procedure terminates with the final trajectory τ =τ (i). Otherwise, the annotator localizes the erroneous step t and its associated action at, and formulates a corrective hint hi. 3. Localized revision and resumption. The context at step t is augmented with hi, yielding an updated state. From this state, the reasoner resumes its computation and produces the refined trajectory τ (i+1).
Open Source Code	Yes	The models and code are available at: https://github.com/Chengpeng Li1003/Co RT.
Open Datasets	Yes	Numinamath. [https://huggingface.co/AI-MO/Numina Math-1.5](https://github.com/project-numina/aimo-progress-prize/blob/main/report/numina_dataset.pdf), 2024.
Dataset Splits	Yes	Prompt-Hint-SFT-32B: Starting from the Deep Seek-R1-32B base model, we fine-tuned using 800 data samples with a learning rate of 1 10 5, running for 17 epochs with a batch size of 96. Hint-Engineering-SFT-32B: Based on Deep Seek-R1-32B, we fine-tuned using only 30 high-quality, human-annotated data samples with a batch size of 96, learning rate of 1 10 5, and 40 epochs. Hint-Engineering-RFT-32B: Building upon Hint-Engineering-SFT-32B, we further finetuned using 800 filtered data samples with a learning rate of 1 10 5, 17 epochs, and batch size of 96. [...] We distilled both Prompt-Hint-SFT-32B and Hint-Engineering-RFT-32B down to the Deep Seek-R1-1.5B architecture using 10k data samples [...] The RL training data was carefully selected by computing the average accuracy over 8 samples (avg@8) on 20k randomly selected problems from the Numina Math-1.5 [63] dataset, then selecting only 1k challenging problems where avg@8 = 1/8 for focused training.
Hardware Specification	Yes	Our experiments utilized the following hardware: Training: All training procedures, including supervised fine-tuning (SFT), rejection finetuning (RFT), and reinforcement learning (RL), were conducted on 4 servers, each equipped with 8 NVIDIA A100 GPUs. Evaluation: All model evaluations were performed on single servers, each equipped with 8 NVIDIA A100 GPUs, ensuring consistent measurement conditions across all compared approaches.
Software Dependencies	No	The paper mentions software components like Python code, SymPy, NumPy, SciPy, Matplotlib, Seaborn, pandas, math, statistics, fractions, PuLP, and a Jupyter-like environment, but it does not specify any version numbers for these software dependencies.
Experiment Setup	Yes	For our experiments, we implemented several model variants with different training stages and architectures: 32B Models: Prompt-Hint-SFT-32B: Starting from the Deep Seek-R1-32B base model, we fine-tuned using 800 data samples with a learning rate of 1 10 5, running for 17 epochs with a batch size of 96. Hint-Engineering-SFT-32B: Based on Deep Seek-R1-32B, we fine-tuned using only 30 high-quality, human-annotated data samples with a batch size of 96, learning rate of 1 10 5, and 40 epochs. Hint-Engineering-RFT-32B: Building upon Hint-Engineering-SFT-32B, we further finetuned using 800 filtered data samples with a learning rate of 1 10 5, 17 epochs, and batch size of 96. 1.5B Models: We distilled both Prompt-Hint-SFT-32B and Hint-Engineering-RFT-32B down to the Deep Seek-R1-1.5B architecture using 10k data samples with a learning rate of 7 10 6, 6 epochs, and batch size of 128. For reinforcement learning, we adapted the ve RL framework [62] to implement our specialized design outlined in Section 2.4. We further trained these 1.5B models with a learning rate of 1 10 6, maximum response length of 16,000 tokens, 8 rollouts per problem, and maximum function calls limited to 15 per response, with each function call having a maximum length of 16,000 tokens. [...] Inference Setting: Across all evaluations, we standardized inference parameters with maximum sequence length of 32,768 tokens, maximum function calls limited to 15, maximum tokens per function call set to 32,768, temperature of 0.6, and top-p sampling parameter of 0.95.