Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Is PRM Necessary? Problem-Solving RL Implicitly Induces PRM Capability in LLMs
Authors: Zhangyin Feng, Qianglong Chen, Ning Lu, Yongqian Li, Siqi Cheng, Shuangmu Peng, Duyu Tang, Shengcai Liu, Zhirui Zhang
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this study, we conduct a systematic investigation of the relationship between RL training and PRM capabilities. Our findings demonstrate that problem-solving proficiency and process supervision capabilities represent complementary dimensions of reasoning that co-evolve synergistically during pure RL training. Through a series of experiments, we examine how reasoning models trained solely with rule-based rewards develop strong process-level judgement capabilities, without access to fine-grained supervision. Through a series of controlled experiments on math reasoning tasks, we demonstrate that RL-trained models like Deep Seek-R1 and Qw Q-32B exhibit strong process judgement abilities, exceeding those of models explicitly trained with PRMs. The evaluation results of different LLMs on PROCESSBENCH are presented in Table 1, which summarizes the PRM performance of various models across the GSM8K, MATH, OLYMPIADBENCH, and OMNIMATH datasets. |
| Researcher Affiliation | Collaboration | Zhangyin Feng Huawei Technologies Ltd. EMAIL, Ning Lu HKUST EMAIL, Shengcai Liu Guangdong Provincial Key Laboratory of Brain-Inspired Intelligent Computation, Department of CSE, SUSTech EMAIL |
| Pseudocode | No | The paper describes methods and processes verbally and through experimental results, but it does not present any structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not explicitly state that the authors are releasing their own source code for the methodology described in this paper, nor does it provide a direct link to a repository for their specific implementation. While it references external open-source projects like 'Open-Reasoner-Zero' (https://github.com/Open-Reasoner-Zero/Open-Reasoner-Zero) and 'RLHFlow/RLHF-Reward-Modeling' (https://github.com/RLHFlow/RLHF-Reward-Modeling), these are third-party tools or frameworks used by the authors, not their own implementation code. The NeurIPS checklist also indicates 'NA' for open access to data and code, reinforcing this. |
| Open Datasets | Yes | We evaluate the PRM capabilities of various models using PROCESSBENCH, a publicly available benchmark designed to assess the reasoning abilities of LLMs across diverse domains of mathematical problem-solving. PROCESSBENCH comprises four distinct datasets: GSM8K [3], MATH, OLYMPIADBENCH [9], and OMNIMATH [6]. AIME24 problem indices are sourced from the math-ai/aime24 dataset on Hugging Face. |
| Dataset Splits | No | The paper describes how problems are categorized for analysis (e.g., 'True' or 'False' solutions, 'Correct' or 'Error' judgments) and details sampling for evaluation (e.g., 'sampled 64 solutions'). However, it does not provide specific information regarding the division of datasets into training, validation, and test sets, either by percentages, sample counts, or references to predefined standard splits for model training or evaluation splits for the benchmark. While it mentions using PROCESSBENCH for evaluation and DAPO-Math-17k for RL training, explicit details on data partitioning are absent. |
| Hardware Specification | No | The paper does not provide specific hardware details such as exact GPU/CPU models, processor types, or memory amounts used for running the experiments. While the NeurIPS paper checklist claims justification in Section 3.1 and 4.1, these sections only describe hyperparameters and models used, not hardware specifications. |
| Software Dependencies | No | The paper mentions various models and algorithms such as 'Qwen2.5-7B-Base', 'DAPO', 'Deep Seek-R1', 'PPO', and 'DPO'. However, it does not provide specific version numbers for software dependencies or libraries (e.g., Python, PyTorch, TensorFlow, or specific library versions) that would be needed to replicate the experimental environment. |
| Experiment Setup | Yes | In Section 3.1, the paper states: 'The hyperparameters are set as follows: learning rate of 1e-6, batch size of 256, prompt length of 2048, output length of 10240, group size of 16, clipping ratio (high) of 0.28, overlong buffer length of 4096, and an overlong penalty factor of 1.0.' Section 4.1 further details evaluation settings: 'we evaluate two variants on the PROCESSBENCH benchmark... (1) we prompt these models as generative PRMs, and (2) a Self-REF-enhanced PRM incorporating the modelβs own solutions as supervisory signals.' and 'Evaluations are conducted on AIME24, AIME25, and CNMO24, with performance compared across direct sampling (Pass@k), majority voting, Bo N with external PRM, and Bo N with Self-PRM.' It also specifies sampling sizes: 'k = 8, 16, 32, 64'. |