Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

VersaPRM: Multi-Domain Process Reward Model via Synthetic Reasoning Data

Authors: Thomas Zeng, Shuibai Zhang, Shutong Wu, Christian Classen, Daewon Chae, Ethan Ewer, Minjae Lee, Heeju Kim, Wonjun Kang, Jackson Kunde, Ying Fan, Jungtaek Kim, Hyung Il Koo, Kannan Ramchandran, Dimitris Papailiopoulos, Kangwook Lee

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	However, they are predominantly trained on mathematical data and their generalizability to nonmathematical domains has not been rigorously studied. In response, this work first shows that current PRMs have poor performance in other domains. To address this limitation, we introduce Versa PRM, a multi-domain PRM trained on synthetic reasoning data generated using our novel data generation and annotation method. Versa PRM achieves consistent performance gains across diverse domains. For instance, in the MMLU-Pro category of Law, Versa PRM via weighted majority voting, achieves a 7.9% performance gain over the majority voting baseline surpassing Qwen2.5-Math-PRM s gain of 1.3%. We further contribute to the community by opensourcing all data, code and models for Versa PRM.
Researcher Affiliation	Collaboration	1University of Wisconsin Madison 2Korea University 3Furiosa AI 4University of California, Berkeley.
Pseudocode	Yes	See Algorithm 1 for details. Monte Carlo Tree Search (MCTS). MCTS is a search algorithm used during test-time inference (Hao et al., 2023; Wan et al., 2024) that iteratively builds a search tree to find the Co T with the highest aggregated PRM score. A detailed description is presented in Appendix C and the pseudo-code is provided in Algorithm 2.
Open Source Code	Yes	We open-source the implementation of Versa PRM with training details, our multi-domain reasoning data, and its model checkpoint; available at https://github. com/UW-Madison-Lee-Lab/Versa PRM.
Open Datasets	Yes	Notably, by sampling questions from the MMLU-Pro dataset (Wang et al., 2024c), we generate Co Ts to produce step-by-step reasoning using an LLM-based generator, i.e., Llama-3.1-8B-Instruct (Dubey et al., 2024), and then auto-label them using an LLM-based labeler, i.e., Llama-3.1-70B-Instruct (Dubey et al., 2024). Versa PRM, which is trained on the resulting synthetic multi-domain reasoning data, shows strong performance across diverse domains.
Dataset Splits	Yes	For our multi-domain evaluation dataset, we curate questions sampled from the MMLU-Pro dataset (Wang et al., 2024c)... To craft our evaluation dataset, we randomly sample 150 questions from each domain... For training, we source questions from the MMLU-Pro dataset (Wang et al., 2024c), by randomly sampling up to 500 questions per domain, ensuring that it is disjoint to the subset used for evaluation.
Hardware Specification	No	The Co Ts generation and labeling were done using AWS Bedrock batch inference at a total cost under $100 (USD).
Software Dependencies	No	The paper mentions using Llama-3.1-8B-Instruct and Llama-3.1-70B-Instruct as LLM models, but does not provide specific version numbers for any software libraries or frameworks used for implementation (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup	Yes	During generation, we use a temperature of 0.8 and set the maximum generation length to 2,048 tokens. During auto-labeling, we use a temperature of 0, and the maximum generation length remains at 2,048 tokens. For training our math PRMs on the PRM800K dataset (Qwen PRM800K and Llama PRM800K), we employ a batch size of 128 and perform full fine-tuning. For experiments on mixed-domain datasets, we reduce the batch size to 32 due to smaller dataset size. All training is conducted over a single epoch. For full fine-tuning, we use a learning rate of 1.25 10 6, while for Lo RA-based fine-tuning, we use a learning rate of 1.0 10 4.