Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

VeriThinker: Learning to Verify Makes Reasoning Model Efficient

Authors: Zigeng Chen, Xinyin Ma, Gongfan Fang, Ruonan Yu, Xinchao Wang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments validate that Veri Thinker substantially reduces reasoning chain lengths while maintaining or even slightly improving accuracy. When applied to Deep Seek-R1-Distill-Qwen-7B, our approach reduces reasoning tokens on MATH500 from 3790 to 2125 while improving accuracy by 0.8% (94.0% to 94.8%), and on AIME25, tokens decrease from 14321 to 10287 with a 2.1% accuracy gain (38.7% to 40.8%).
Researcher Affiliation	Academia	Zigeng Chen, Xinyin Ma, Gongfan Fang, Ruonan Yu, Xinchao Wang National University of Singapore EMAIL, EMAIL
Pseudocode	No	The paper describes the Supervised Verification Fine-Tuning (SVFT) strategy and provides mathematical formulations (Eq 1, 2, 3) for the objective and loss function, but does not present the method in a structured pseudocode or algorithm block.
Open Source Code	Yes	Code is available at https://github.com/czg1225/Veri Thinker
Open Datasets	Yes	For evaluation, we employ multiple mathematical benchmark datasets, including MATH500 [16], GSM8K [6], and two highly challenging competition datasets, AIME2024 and AIME2025. Additionally, in Appendix B, the paper states: "Problem Collection. ... we aggregate problems from four mathematical datasets known for their breadth of content and varying difficulties: PRM12K [32], GSM8K [6], LIMO [70], and Numina-Math [29]."
Dataset Splits	Yes	Datasets. During the fine-tuning phase, we utilize our self-constructed Co T-verification dataset comprising approximately 340k question-Co T pairs, each labeled with correctness indicators. We provide more details about training set construction in the Appendix. For evaluation, we employ multiple mathematical benchmark datasets, including MATH500 [16], GSM8K [6], and two highly challenging competition datasets, AIME2024 and AIME2025.
Hardware Specification	Yes	All training procedures are conducted using Hugging Face s SFTTrainer integrated with Deep Speed Ze RO-2 optimization [47], distributed across four RTX 6000 Ada GPUs. For evaluation, inference results and throughput metrics are also obtained using RTX 6000 Ada GPUs with the v LLM inference framework [23].
Software Dependencies	No	The paper mentions using 'Hugging Face s SFTTrainer integrated with Deep Speed Ze RO-2 optimization [47]' and 'v LLM inference framework [23]' along with 'Low-Rank Adaptation (Lo RA) [18]', but specific version numbers for these software components are not provided in the text.
Experiment Setup	Yes	Our Lo RA configurations are presented in Table 5. We utilized different Lo RA ranks and alpha values for the three distinct models to achieve the optimal balance between underfitting and catastrophic forgetting. All other training hyperparameters remain consistent across models: learning rate = 3e-5, Lo RA dropout = 0.05, weight decay = 0.01, and batch size = 64. All models were trained for 2 epochs on our self-constructed Co T-Verification dataset.