Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
SCAN: Self-Denoising Monte Carlo Annotation for Robust Process Reward Learning
Authors: Yuyang Ding, Xinyu Shi, Juntao Li, Xiaobo Liang, Zhaopeng Tu, Min zhang
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate the trained PRM from two perspectives: test-time scaling (best-of-8 evaluation) and step-wise error detection (Process Bench [44]). The results demonstrate that our model consistently outperforms strong existing baselines, achieving performance comparable to that of the human-annotated dataset PRM800K. |
| Researcher Affiliation | Collaboration | Yuyang Ding1, Xinyu Shi1, Juntao Li1 , Xiaobo Liang1, Zhaopeng Tu2, Min Zhang1 1Soochow University 2Tencent |
| Pseudocode | No | The paper describes its methodology in text and provides a workflow diagram in Figure 2, but it does not contain any formally labeled 'Pseudocode' or 'Algorithm' blocks with structured steps. |
| Open Source Code | Yes | We provide our code implementation of our preliminary study and main experiments in https://anonymous.4open.science/r/SCAN-PRM. |
| Open Datasets | Yes | We use 7,500 questions with golden answers from the MATH training set. We evaluate the trained PRM from two perspectives: test-time scaling (best-of-8 evaluation) and step-wise error detection (Process Bench [44]). The evaluation datasets cover various difficulty levels, including GSM8K [2] (elementary), MATH [11] (competition), College Math [31] (college), and Olympiad Bench [10] (Olympiad). |
| Dataset Splits | No | The paper mentions generating SCAN-Base (101K) and SCAN-Pro (197K) as 'training datasets' and evaluates on external benchmarks like Process Bench, GSM8K, MATH, College Math, and Olympiad Bench. However, it does not explicitly provide specific train/validation/test splits for its own constructed SCAN datasets, or for the external datasets used for training. |
| Hardware Specification | Yes | All experiments were conducted on a single machine equipped with 8 GPUs, requiring only 47 hours in real time. The inference speed is tested on 4 A100-40G GPUs. |
| Software Dependencies | No | For inference deployment, we explored two approaches: (1) launching multiple vLLM [15] servers and making API calls, using distributed routing strategies powered by Fast Chat [45], and (2) batching multiple queries and utilizing Ray [22] to schedule resources for distributed inference. The paper mentions these software tools but does not provide specific version numbers for any of them. |
| Experiment Setup | Yes | We train our process reward models based on Qwen2.5-Math-7B-Instruct with a constant learning rate of 7e-6 and batch size of 128. The model is trained for one epoch, as we observed that multiple epochs lead to rapid overfitting, particularly on synthetic data. |