Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

SRPO: Enhancing Multimodal LLM Reasoning via Reflection-Aware Reinforcement Learning

Authors: Zhongwei Wan, Zhihao Dou, Che Liu, Yu Zhang, Dongfei Cui, Qinjian Zhao, Hui Shen, Jing Xiong, Yi Xin, Yifan Jiang, Chaofan Tao, Yangfan He, Mi Zhang, Shen Yan

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments across multiple multimodal reasoning benchmarks including Math Vista, Math Vision, Mathverse, and MMMU-Pro using Qwen-2.5-VL-7B and Qwen-2.5-VL-32B demonstrate that SRPO significantly outperforms state-of-the-art models, achieving notable improvements in both reasoning accuracy and reflection quality.
Researcher Affiliation Collaboration Zhongwei Wan2 Zhihao Dou3 Che Liu4 Yu Zhang11 Dongfei Cui5 Qinjian Zhao6 Hui Shen7 Jing Xiong10 Yi Xin12 Yifan Jiang8 Chaofan Tao10 Yangfan He9 Mi Zhang2 Shen Yan1 1Byte Dance Seed 2The Ohio State University 3Case Western Reserve University 4Imperial College London 5Duke University 6Kean University 7University of Michigan 8University of Southern California 9University of Minnesota 10The University of Hong Kong 11Tongji University 12Nanjing University Correspondence to: EMAIL, EMAIL
Pseudocode No The paper describes the methodology using mathematical equations and descriptive text, for example, in Section 3.2.1 'Group Relative Policy Optimization (GRPO)', but it does not include explicitly labeled pseudocode or algorithm blocks.
Open Source Code No The NeurIPS checklist states 'Yes' for open access to data and code, but the justification only mentions providing 'details about our dataset (Appendix B.2), hyperparameters (Appendix B.3), and prompt templates (Appendix B.4) in Appendix B', rather than explicit code release. The project website https://srpo.pages.dev states 'Code will be released soon.'
Open Datasets Yes To construct the self-reflection SFT dataset for the cold-start initialization phase, we first curate samples from several established multimodal reasoning sources, including the Mulberry dataset (260K) [24], Math V360K [40], and LLa VA-Co T dataset (100K) [25]. For the subsequent reinforcement learning phase, we aggregate a diverse collection of multimodal reasoning samples from multiple datasets, such as Science QA [41], Geometric Math QA [42], Chart QA [43], DVQA [44], AI2D [45], MATH [46], Virgo [47], R1-One Vision [11], MMK12 [8], and Phy X [48].
Dataset Splits No The paper describes the collection of a refined SFT dataset of 'approximately 10K samples' and an RL training dataset, stating 'The RL training dataset consists of diverse, cross-domain reasoning samples'. It does not explicitly provide specific train/test/validation dataset splits (percentages or counts) for these constructed datasets used in their training phases.
Hardware Specification Yes For self-reflection cold-start SFT and subsequent RL training, Qwen2.5-VL-7B-Instruct and Qwen2.5-VL-32B-Instruct models are trained on 8 and 32 NVIDIA H100 GPUs, respectively. Inference efficiency overview. Latency on Math Vista test-mini using a single H100-80G GPU with v LLM.
Software Dependencies No During RL, we adopt the Open RLHF framework [61], training for 3 epochs on 30K samples with rollout and training batch sizes set to 128 (8 rollouts per sample), a sampling temperature of 1.0, and Adam optimizer with a learning rate of 1 10 6. The paper mentions software components like "Open RLHF framework" and "Adam optimizer" but does not provide specific version numbers for them.
Experiment Setup Yes We adopt 1 epoch for SFT to avoid overfitting. During RL, we adopt the Open RLHF framework [61], training for 3 epochs on 30K samples with rollout and training batch sizes set to 128 (8 rollouts per sample), a sampling temperature of 1.0, and Adam optimizer with a learning rate of 1 10 6. For the reflection reward parameter α, we set it to 0.1 to ensure training stability. Regarding the reflective brevity reward flen(Lresponse), to discourage excessively verbose outputs, we define Ttarget as 2 the length of the original response (i.e., reflection plus new reasoning equals the first think length), and set Tmax to 2.5 the original length (i.e., reflection plus new reasoning equals 1.5 the first think length).