Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
DreamPRM: Domain-reweighted Process Reward Model for Multimodal Reasoning
Authors: Qi Cao, Ruiyi Wang, Ruiyi Zhang, Sai Ashish Somayajula, Pengtao Xie
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on multiple multimodal reasoning benchmarks covering both mathematical and general reasoning show that test-time scaling with Dream PRM consistently improves the performance of state-of-the-art MLLMs. |
| Researcher Affiliation | Academia | Qi Cao University of California, San Diego EMAIL Ruiyi Wang University of California, San Diego EMAIL Ruiyi Zhang University of California, San Diego EMAIL Sai Ashish Somayajula University of California, San Diego EMAIL Pengtao Xie University of California, San Diego Mohamed bin Zayed University of Artificial Intelligence (MBZUAI) EMAIL |
| Pseudocode | No | To solve this optimization problem, we propose an efficient gradient-based algorithm, which is detailed in Appendix A. Appendix A describes the algorithm using equations and text, but it is not formatted as a distinct pseudocode block with typical structured programming constructs. |
| Open Source Code | Yes | Project Page: https://github.com/coder-qicao/Dream PRM |
| Open Datasets | Yes | We use 15 multimodal datasets for lower-level optimization (Dtr), covering four domains: science, chart, geometry, and commonsense, as listed in Appendix Table 2. For upper-level optimization (Dmeta), we adopt the MMMU [79] dataset. |
| Dataset Splits | Yes | After processing and sampling, the training datasets in lower-level Dtr have around 15k examples (1k per each of the 15 domains), while the meta dataset in the upper-level Dmeta has around 1k validation examples from the MMMU [79] dataset. |
| Hardware Specification | Yes | Our method is implemented with Betty [7], and the fine-tuning process takes approximately 10 hours on one NVIDIA A100 GPUs. |
| Software Dependencies | No | The paper mentions "Adam W [32] optimizer" and "Betty [7]" but does not provide specific version numbers for these software components. |
| Experiment Setup | Yes | In the lower-level optimization, we perform 5 inner gradient steps per outer update (unroll steps = 5) using the Adam W [32] optimizer with learning rate set to 5 × 10−7. In the upper-level optimization, we use the Adam W optimizer (lr = 0.01, weight decay = 10−3) and a Step LR scheduler (step size = 5000, γ = 0.5). In total, Dream PRM is fine-tuned for 10000 iterations. |