Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Unlocking Multimodal Mathematical Reasoning via Process Reward Model

Authors: Ruilin Luo, Zhuofan Zheng, Lei Wang, Yifan Wang, Xinzhe Ni, Zicheng Lin, Songtao Jiang, Yiyao Yu, Chufan Shi, Ruihang Chu, Jin zeng, Yujiu Yang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Results on 6 multimodal reasoning benchmarks show that our PRM improves Best-of-N verification, surpassing self-consistency and outcome-based baselines. When used in PS-GRPO, the resulting model achieves state-of-the-art performance among open-source MLLMs of similar size. Our contributions are as follows: ... Experimental results show that our reward model improves both test-time verification and online training. With PS-GRPO application, URSA-8B-PS-GRPO outperforms Gemma3-12B and GPT-4o by 8.4% and 2.7% on average across 6 benchmarks.
Researcher Affiliation Collaboration Ruilin Luo12 Zhuofan Zheng2 Lei Wang3 Yifan Wang1 Xinzhe Ni1 Zicheng Lin1 Songtao Jiang4 Yiyao Yu1 Chufan Shi1 Ruihang Chu1 Jin Zeng2 Yujiu Yang1 1Tsinghua University 2Byte Dance 3Ping An Technology (Shenzhen) Co., Ltd. 4Zhejiang University
Pseudocode Yes Algorithm 1 Binary Error Locating Algorithm 2 PS-GRPO
Open Source Code Yes Code, data and checkpoint can be found at https://github.com/URSA-MATH.
Open Datasets Yes We release two large-scale open-source datasets, MMath Co T-1M and Dual Math-1.1M, to address the scarcity of high-quality multimodal Co T reasoning and process supervision data. Code, data and checkpoint can be found at https://github.com/URSA-MATH.
Dataset Splits Yes We select 15K data in MMath Co T-1M for PS-GRPO. We collect 20K data with a types mixture ratio similar to that of instruction fine-tuning and conduct a one-time static filtering before RL. Specifically, we use URSA-8B to perform 8 samplings on this 20K data, filtering out examples where all 8 sampling results are either incorrect or correct. This left approximately 15K+ data for training vanilla GRPO and PS-GRPO.
Hardware Specification Yes Unless otherwise specified, experiments are conducted on 32 NVIDIA-H100-HBM3 GPUs by default. During the data pair generation phase, we use 16 NVIDIA-H100-HBM3 GPUs for inference, which takes approximately 28 hours.
Software Dependencies Yes Our experiments are based on Python 3.10 and Py Torch 2.4.0+cu124.
Experiment Setup Yes The hyperparameter and time cost used in Stage I and Stage II are demonstrated in Table 14. Since the parameters used in Stage III are somewhat different, we list them separately in Table 15.