Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
VersaPRM: Multi-Domain Process Reward Model via Synthetic Reasoning Data
Authors: Thomas Zeng, Shuibai Zhang, Shutong Wu, Christian Classen, Daewon Chae, Ethan Ewer, Minjae Lee, Heeju Kim, Wonjun Kang, Jackson Kunde, Ying Fan, Jungtaek Kim, Hyung Il Koo, Kannan Ramchandran, Dimitris Papailiopoulos, Kangwook Lee
ICML 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | However, they are predominantly trained on mathematical data and their generalizability to nonmathematical domains has not been rigorously studied. In response, this work first shows that current PRMs have poor performance in other domains. To address this limitation, we introduce Versa PRM, a multi-domain PRM trained on synthetic reasoning data generated using our novel data generation and annotation method. Versa PRM achieves consistent performance gains across diverse domains. For instance, in the MMLU-Pro category of Law, Versa PRM via weighted majority voting, achieves a 7.9% performance gain over the majority voting baseline surpassing Qwen2.5-Math-PRM s gain of 1.3%. We further contribute to the community by opensourcing all data, code and models for Versa PRM. |
| Researcher Affiliation | Collaboration | 1University of Wisconsin Madison 2Korea University 3Furiosa AI 4University of California, Berkeley. |
| Pseudocode | Yes | See Algorithm 1 for details. Monte Carlo Tree Search (MCTS). MCTS is a search algorithm used during test-time inference (Hao et al., 2023; Wan et al., 2024) that iteratively builds a search tree to find the Co T with the highest aggregated PRM score. A detailed description is presented in Appendix C and the pseudo-code is provided in Algorithm 2. |
| Open Source Code | Yes | We open-source the implementation of Versa PRM with training details, our multi-domain reasoning data, and its model checkpoint; available at https://github. com/UW-Madison-Lee-Lab/Versa PRM. |
| Open Datasets | Yes | Notably, by sampling questions from the MMLU-Pro dataset (Wang et al., 2024c), we generate Co Ts to produce step-by-step reasoning using an LLM-based generator, i.e., Llama-3.1-8B-Instruct (Dubey et al., 2024), and then auto-label them using an LLM-based labeler, i.e., Llama-3.1-70B-Instruct (Dubey et al., 2024). Versa PRM, which is trained on the resulting synthetic multi-domain reasoning data, shows strong performance across diverse domains. |
| Dataset Splits | Yes | For our multi-domain evaluation dataset, we curate questions sampled from the MMLU-Pro dataset (Wang et al., 2024c)... To craft our evaluation dataset, we randomly sample 150 questions from each domain... For training, we source questions from the MMLU-Pro dataset (Wang et al., 2024c), by randomly sampling up to 500 questions per domain, ensuring that it is disjoint to the subset used for evaluation. |
| Hardware Specification | No | The Co Ts generation and labeling were done using AWS Bedrock batch inference at a total cost under $100 (USD). |
| Software Dependencies | No | The paper mentions using Llama-3.1-8B-Instruct and Llama-3.1-70B-Instruct as LLM models, but does not provide specific version numbers for any software libraries or frameworks used for implementation (e.g., Python, PyTorch, TensorFlow versions). |
| Experiment Setup | Yes | During generation, we use a temperature of 0.8 and set the maximum generation length to 2,048 tokens. During auto-labeling, we use a temperature of 0, and the maximum generation length remains at 2,048 tokens. For training our math PRMs on the PRM800K dataset (Qwen PRM800K and Llama PRM800K), we employ a batch size of 128 and perform full fine-tuning. For experiments on mixed-domain datasets, we reduce the batch size to 32 due to smaller dataset size. All training is conducted over a single epoch. For full fine-tuning, we use a learning rate of 1.25 10 6, while for Lo RA-based fine-tuning, we use a learning rate of 1.0 10 4. |