Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Reasoning Is Not a Race: When Stopping Early Beats Going Deeper

Authors: Mohan Zhang, Jiaxuan Gao, Shusheng Xu, YI WU

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Across multiple math benchmarks and model scales, ZGES outperforms both standard PRM-guided beam search and the PRM-free methods. Ablation studies further highlight the advantages and robustness of ZGES s adaptive stopping mechanism. We evaluate ZGES on Long Co T models of varying scales, including Deep Seek-R1-Distill-Qwen1.5B and 7B [9]. It consistently delivers strong performance across all Beam Search configurations and achieves the best results on all evaluated benchmarks. Main Result The comparison results are presented in Figure 4. ZGES demonstrates highly competitive performance compared to the baselines, with the following notable advantages: Best Performance Across Settings. Robustness to Beam Size Scaling. Improved Performance with Reduced Computational Cost.
Researcher Affiliation	Collaboration	Mohan Zhang1 Jiaxuan Gao1 Shusheng Xu2 Yi Wu1 1IIIS, Tsinghua University 2Ant Group
Pseudocode	No	The paper describes the ZGES method in Section 4.2 with numbered steps (1-4). However, these steps are presented as regular paragraph text within a numbered list, not in a formalized pseudocode block or algorithm-like structure with specific code-like formatting or explicit labeling such as "Pseudocode" or "Algorithm".
Open Source Code	No	Does the paper provide open access to the data and code...? Answer: [No] Justification: We do not release the code, but our method is straightforward to reproduce.
Open Datasets	Yes	Evaluations are performed on three challenging math benchmarks: AMC2023, AIME2024, and AIME2025. For PRM training, we follow the settings proposed by [3]... To train hard PRM and soft PRM, we gather 9K challenging questions from the training data of the Deep Scale R project [18] by keeping problems that have non-trivial accuracy for Deep Seek-R1Distill-Qwen-7B.
Dataset Splits	No	The paper mentions using "AMC2023, AIME2024, and AIME2025" for evaluation and "9K challenging questions from the training data of the Deep Scale R project [18]" for PRM training. While the datasets are identified, the paper does not specify explicit training/validation/test splits (e.g., percentages, sample counts, or references to predefined splits) for any of these datasets used in the experiments.
Hardware Specification	No	The paper does not provide specific hardware details such as GPU models (e.g., NVIDIA A100), CPU models, or memory amounts used for running the experiments. While the NeurIPS checklist indicates that hardware details are in the appendix, a review of the appendix (B.1, B.2, B.3, C) does not reveal such specific information.
Software Dependencies	No	The paper mentions using "Deep Seek-R1-Distill-Qwen-1.5B and 7B [9]" as models, but it does not specify versions for general software libraries or environments like Python, PyTorch, TensorFlow, CUDA, or other key software components with their version numbers.
Experiment Setup	Yes	For PRM training, we follow the settings proposed by [3], including both the hard and soft configurations (details are provided in Appendix B.1). For the hyperparameter λ, we set λ = 0.6 for the 1.5B model and λ = 0 for the 7B model. In all Beam Search settings, the expansion factor is fixed to 2. In Appendix B.2, it states: starting from each candidate, we generate N subsequent trajectories (we set N = 4 in our experiments) and approximate the step quality by the proportion of trajectories that reach the correct answer.