Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Towards Thinking-Optimal Scaling of Test-Time Compute for LLM Reasoning

Authors: Wenkai Yang, Shuming Ma, Yankai Lin, Furu Wei

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our explorations on mathematical reasoning tasks reveal an unexpected finding that scaling with longer Co Ts can indeed impair the reasoning performance of LLMs in certain domains. Moreover, we discover that there exists an optimal scaled length distribution that differs across different domains. Based on these insights, we propose a Thinking Optimal Scaling strategy. Our method first uses a small set of seed data with varying response length distributions to teach the model to adopt different reasoning efforts for deep thinking. Then, the model selects its shortest correct response under different reasoning efforts on additional problems for self-improvement. Our self-improved models built upon Qwen2.5-32B-Instruct outperform other distillation-based 32B o1-like models across various math benchmarks, and achieve performance on par with the teacher model Qw Q-32B-Preview that produces the seed data.
Researcher Affiliation	Collaboration	1Gaoling School of Artificial Intelligence, Renmin University of China 2Microsoft Research EMAIL EMAIL
Pseudocode	No	The paper includes illustrations and flowcharts (e.g., Figure 6) but no explicitly labeled 'Pseudocode' or 'Algorithm' blocks with structured, code-like steps.
Open Source Code	Yes	Code, data and models are available at https://github.com/RUCBM/TOPS.
Open Datasets	Yes	We calculate the accuracy and the average number of generated tokens of each model on two typical benchmarks: MATH500 [20]: 500 high school math competition problems across various subjects, sampled from MATH benchmark [13]; AIME2024: 30 challenging problems from the American Invitational Mathematics Examination (AIME). To address the issue of token counts not being directly comparable due to the different tokenizers used by different models, we standardize by using Qwen2.5 tokenizer to tokenize the reasoning completions of different models and then calculate the number of tokens. As the internal Co T of o1-mini is not available to users, we use an estimation strategy based on the summary part, the number of reasoning tokens and total number of completion tokens returned from the o1-mini model to estimate the number of tokens of hidden Co T tokenized by Qwen2.5 tokenizer. Details and further discussions are in Appendix B. We set the maximum number of generation tokens to 16,384 for each model in all evaluations. ... We further include GSM8K [4] that contains 1319 grade school math word problems. ... We first use three system prompts (refer to Figure 7), corresponding to different levels of reasoning effort ( Low , Medium and High ), to prompt Qw Q-32B-Preview to generate solutions of different numbers of tokens for the same set of math problems sampled from Numina Math [17].
Dataset Splits	Yes	We then evaluate the performance of our model on three typical math reasoning benchmarks: GSM8K [4], MATH500 and AIME2024. ... Finally, we incorporate the responses corresponding to low reasoning effort from the seed data into the above generated dataset, resulting in a thinking-optimal dataset of about 26K samples for self-improvement. We denote the self-improved model created by our method as Qwen2.5-32B-TOPS. We then evaluate the performance of our model on three typical math reasoning benchmarks: GSM8K, MATH500, and AIME2024.
Hardware Specification	Yes	The training is performed on 8 NVIDIA H100 80G. ... The training is performed on 4 NVIDIA H100 80G. ... The training is performed on 8 NVIDIA H100 80G. ... All evaluations are conducted on 4 NVIDIA A100 80G.
Software Dependencies	No	The paper mentions using 'Qwen2.5 tokenizer' and 'LLa MA3.1 tokenizer' but does not specify version numbers for any software libraries, frameworks, or programming languages used in the experiments.
Experiment Setup	Yes	In the SFT stage, the learning rate is 1e-5, the batch size is 96, and the number of epochs is 2. In inference, the decoding temperature is 1.0, the maximum generation length is 16,384. We report the average accuracy across 5 random seeds in each experiment. ... When creating two tag models, the learning rate is 1e-5, the batch size is 32. The number of epochs is 3 for Qwen2.5-32B-Tag and 5 for LLa MA3.1-8B-Tag. ... In the self-improvement stage, we perform SFT on Qwen2.5-32B-Instruct on the curated thinkingoptimal dataset for 2 epochs. The learning rate is 1e-5, and the batch size is 96. ... In the iterative self-improvement stage, for Qwen2.5-32B-TOPS-Iter-SFT, the learning rate is 1e-6, the batch size is 32, and we set the training epoch to 1. For Qwen2.5-32B-TOPS-Iter-DPO, the learning rate is 5e-7, the batch size is 32, the training epoch is 3.