Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning

Authors: Haozhe Wang, Chao Qu, Zuming Huang, Wei Chu, Fangzhen Lin, Wenhu Chen

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Figure 1: Performance comparison between VL-Rethinker and other So TA models on different multimodal reasoning benchmarks. We conduct comprehensive ablations and analysis to provide insights into the effectiveness of our approach. Our approach demonstrates significant performance gains, as evidenced by the quantitative results.
Researcher Affiliation Academia Haozhe Wang , Chao Qu , Zuming Huang , Wei Chu , Fangzhen Lin , Wenhu Chen HKUST , University of Waterloo , INF , Vector Institute Corresponding to: EMAIL, EMAIL
Pseudocode Yes Algorithm 1 Selective Sample Replay (SSR)
Open Source Code Yes We released code, models and our high-quality 39K dataset to support further research.
Open Datasets Yes Our training data was compiled by integrating publicly available datasets [Du et al., 2025, Yang et al., 2025, Meng et al., 2025] with novel data collected from the web. Our initial seed query set was constructed by aggregating publicly available multimodal datasets [Yang et al., 2025, Meng et al., 2025, Kembhavi et al., 2016, Saikh et al., 2022, Du et al., 2025] with novel queries gathered from the web.
Dataset Splits No The paper mentions subsets of training data (16,000 queries for 7B model, 20,000 queries for 32B and 72B models) and the use of a held-out validation set, but it does not specify explicit train/test/validation splits (e.g., percentages or exact counts) for the evaluation benchmarks or the overall 38,870 queries dataset.
Hardware Specification Yes Our VL-Rethinker-72B was trained using Open RLHF for a maximum of 3 epochs on 8 sets of 8 A800(80G) for approximately 60 hours.
Software Dependencies No The paper mentions 'Open RLHF' for training and 'Qwen2.5-VL-72B' as a base model, but it does not provide specific version numbers for any software libraries, frameworks (like PyTorch or TensorFlow), or programming languages.
Experiment Setup Yes Our VL-Rethinker-72B was trained using Open RLHF for a maximum of 3 epochs... We employed a near on-policy RL paradigm, where the behavior policy was synchronized with the improvement policy after every 1024 queries, which we define as an episode. The replay buffer for SSR persisted for the duration of each episode before being cleared. For each query, we sampled 8 responses. The training batch size was set to 512 query-response pairs. We accept at most two correct rethinking trajectories for each query. We set the priority hyperparameter in SSR to α = 1.0 in the experiments.