Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

SCAN: Self-Denoising Monte Carlo Annotation for Robust Process Reward Learning

Authors: Yuyang Ding, Xinyu Shi, Juntao Li, Xiaobo Liang, Zhaopeng Tu, Min zhang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate the trained PRM from two perspectives: test-time scaling (best-of-8 evaluation) and step-wise error detection (Process Bench [44]). The results demonstrate that our model consistently outperforms strong existing baselines, achieving performance comparable to that of the human-annotated dataset PRM800K.
Researcher Affiliation	Collaboration	Yuyang Ding1, Xinyu Shi1, Juntao Li1 , Xiaobo Liang1, Zhaopeng Tu2, Min Zhang1 1Soochow University 2Tencent
Pseudocode	No	The paper describes its methodology in text and provides a workflow diagram in Figure 2, but it does not contain any formally labeled 'Pseudocode' or 'Algorithm' blocks with structured steps.
Open Source Code	Yes	We provide our code implementation of our preliminary study and main experiments in https://anonymous.4open.science/r/SCAN-PRM.
Open Datasets	Yes	We use 7,500 questions with golden answers from the MATH training set. We evaluate the trained PRM from two perspectives: test-time scaling (best-of-8 evaluation) and step-wise error detection (Process Bench [44]). The evaluation datasets cover various difficulty levels, including GSM8K [2] (elementary), MATH [11] (competition), College Math [31] (college), and Olympiad Bench [10] (Olympiad).
Dataset Splits	No	The paper mentions generating SCAN-Base (101K) and SCAN-Pro (197K) as 'training datasets' and evaluates on external benchmarks like Process Bench, GSM8K, MATH, College Math, and Olympiad Bench. However, it does not explicitly provide specific train/validation/test splits for its own constructed SCAN datasets, or for the external datasets used for training.
Hardware Specification	Yes	All experiments were conducted on a single machine equipped with 8 GPUs, requiring only 47 hours in real time. The inference speed is tested on 4 A100-40G GPUs.
Software Dependencies	No	For inference deployment, we explored two approaches: (1) launching multiple vLLM [15] servers and making API calls, using distributed routing strategies powered by Fast Chat [45], and (2) batching multiple queries and utilizing Ray [22] to schedule resources for distributed inference. The paper mentions these software tools but does not provide specific version numbers for any of them.
Experiment Setup	Yes	We train our process reward models based on Qwen2.5-Math-7B-Instruct with a constant learning rate of 7e-6 and batch size of 128. The model is trained for one epoch, as we observed that multiple epochs lead to rapid overfitting, particularly on synthetic data.