Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

DreamPRM: Domain-reweighted Process Reward Model for Multimodal Reasoning

Authors: Qi Cao, Ruiyi Wang, Ruiyi Zhang, Sai Ashish Somayajula, Pengtao Xie

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on multiple multimodal reasoning benchmarks covering both mathematical and general reasoning show that test-time scaling with Dream PRM consistently improves the performance of state-of-the-art MLLMs.
Researcher Affiliation	Academia	Qi Cao University of California, San Diego EMAIL Ruiyi Wang University of California, San Diego EMAIL Ruiyi Zhang University of California, San Diego EMAIL Sai Ashish Somayajula University of California, San Diego EMAIL Pengtao Xie University of California, San Diego Mohamed bin Zayed University of Artiﬁcial Intelligence (MBZUAI) EMAIL
Pseudocode	No	To solve this optimization problem, we propose an efﬁcient gradient-based algorithm, which is detailed in Appendix A. Appendix A describes the algorithm using equations and text, but it is not formatted as a distinct pseudocode block with typical structured programming constructs.
Open Source Code	Yes	Project Page: https://github.com/coder-qicao/Dream PRM
Open Datasets	Yes	We use 15 multimodal datasets for lower-level optimization (Dtr), covering four domains: science, chart, geometry, and commonsense, as listed in Appendix Table 2. For upper-level optimization (Dmeta), we adopt the MMMU [79] dataset.
Dataset Splits	Yes	After processing and sampling, the training datasets in lower-level Dtr have around 15k examples (1k per each of the 15 domains), while the meta dataset in the upper-level Dmeta has around 1k validation examples from the MMMU [79] dataset.
Hardware Specification	Yes	Our method is implemented with Betty [7], and the ﬁne-tuning process takes approximately 10 hours on one NVIDIA A100 GPUs.
Software Dependencies	No	The paper mentions "Adam W [32] optimizer" and "Betty [7]" but does not provide specific version numbers for these software components.
Experiment Setup	Yes	In the lower-level optimization, we perform 5 inner gradient steps per outer update (unroll steps = 5) using the Adam W [32] optimizer with learning rate set to 5 × 10−7. In the upper-level optimization, we use the Adam W optimizer (lr = 0.01, weight decay = 10−3) and a Step LR scheduler (step size = 5000, γ = 0.5). In total, Dream PRM is ﬁne-tuned for 10000 iterations.