Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Semi-off-Policy Reinforcement Learning for Vision-Language Slow-Thinking Reasoning

Authors: Junhao Shen, Haiteng Zhao, Yuzhe Gu, Songyang Gao, Kuikun Liu, Haian Huang, Jianfei Gao, Dahua Lin, Wenwei Zhang, Kai Chen

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments with Intern VL2.5 and Intern VL3.0 with 8B and 38B sizes show the effectiveness of SOPHIA. Notably, SOPHIA improves Intern VL3.0-38B by 8.50% in average, reaching state-of-the-art performance among open-source LVLMs on multiple multimodal reasoning benchmarks, and even outperforms some closed-source models (e.g., GPT-4.1) on the challenging Math Vision and Olympiad Bench, achieving 49.08% and 49.95% pass@1 accuracy, respectively. Analysis shows SOPHIA outperforms supervised fine-tuning and direct on-policy RL methods, offering a better policy initialization for further on-policy training.
Researcher Affiliation Collaboration 1Shanghai Jiao Tong University 2Shanghai AI Laboratory 3MMLab, The Chinese University of Hong Kong EMAIL
Pseudocode Yes Algorithm 1 The Semi-off-Policy Reinforcement Learning Algorithm 1: Inputs: Training dataset Dtrain, policy model (LVLM) πθ, reasoning language model π, number of rollouts per image K, number of rollouts per caption N. 2: Initialize policy πθ and π as Section 5.1. 3: for I Dtrain do 4: Obtain its image tokens v. 5: Generate K caption samples {c(1), c(2), , c(K)} where y(k) πi( |xd, v). 6: for k = 1, 2, K do 7: Construct prompt ˆxq based on c(k), xq. 8: Sample N reasoning trajectories {y(1), y(2), , y(N)} where y(n) π( |ˆxqi) 9: Calculate the reward of each trajectory with Eq. (4). 10: end for 11: Calculate the reward of each caption with Eq. (5). 12: end for 13: Construct the off-policy dataset D with their reward based on Eq. 6. 14: Update πθ with Eq. (7). 15: Return: The optimized policy model π .
Open Source Code No We will release the open-source modela and training script immediately this paper is accepted.
Open Datasets Yes The warm-up training data consists of the open-source RLAIF-V [60] and Wild Vision [61] datasets. For RLAIF-V, we select the accepted samples, and for Wild Vision, we choose the best response. ... We also analyze SOPHIA on open-source training dataset Math V360K [7] in Section 6.4.
Dataset Splits Yes MMMU [64]: the valiation split of MMMU dataset. MMMU Pro [64]: the vision split of MMMU Pro dataset, where the input image contains both the visual content and the question, while the text query excludes the question. Math Vista [65]: the testmini split of Math Vista dataset. Math Verse [66]: the testmini and vision-only split of Math Verse dataset. Dyna Math [67]: the full test set of Dyna Math dataset. Math Vision [34]: the full test set of Math Vision dataset. MV-MATH [68]: the full test set of MV-MATH dataset. Olympiad Bench [35]: the full test set of Olympiad Bench dataset, including visual and texture questions and excluding all proof questions. ... Specifically, using the same settings as with the private dataset (80K), we train Intern VL2.5-38B on 10% (36K) and 50% (180K) subsets of Math V360K to explore the trade-off between data quality and quantity.
Hardware Specification No The paper discusses computational costs and inference speed, but does not provide specific hardware details like CPU/GPU models or memory. For example, Appendix B.3 states: 'During the full training process, the computational cost of training the Intern VL3.0-38B model serves as the baseline.' and 'Empirically, the inference process of SOPHIA is about 3-4 times slower than that of the base LVLM on standard reasoning benchmarks.' but no specific hardware models are named.
Software Dependencies No The paper mentions 'Adam W optimizer' but does not provide specific version numbers for this or any other software components used in the experiments.
Experiment Setup Yes Hyperparameters. During reasoning-rewarded sampling, we set K = N = 8. The max length of each caption and rollout trajectory is 32678 tokens. During policy updating, the threshold of caption reward α = 0.75, and only the weights of language backbone of the LVLM are unfrozen. The policy model is trained with batchsize of 512, learning rate of 2 10 5, weight decay of 0.05 and employs a cosine annealing learning rate schedule, decaying to 1/4 of the initial learning rate over time. We optimize the policy model using the Adam W optimizer. ... Training Details. We adopt Qwen2.5-72B-Instruct as the generative verifier and use a binary outcome reward signal for GRPO. During training, each batch contains 64 questions, with 8 rollouts per question and a maximum trajectory length of 16384 tokens. The correctness scores across rollouts are averaged to compute a pass rate; questions with a pass rate of exactly 0 or 1 are excluded, using 0.5 as the threshold for incorrectness. The policy model is trained with a learning rate of 5 10 7, with all other settings consistent with Appendix B.