Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

OpenVLThinker: Complex Vision-Language Reasoning via Iterative SFT-RL Cycles

Authors: Yihe Deng, Hritik Bansal, Fan Yin, Nanyun Peng, Wei Wang, Kai-Wei Chang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Open VLThinker-7B consistently advances performance across six benchmarks demanding mathematical and general reasoning, notably improving Math Vista by 3.8%, EMMA by 2.4%, and Hallusion Bench by 1.6%. Our analysis reveals that the base model rarely exhibits reasoning behaviors initially, but SFT effectively surfaces these latent actions and narrows the RL search space, accelerating the development of reasoning capabilities. Each subsequent RL stage further refines the model s reasoning skills, producing higher-quality SFT data for continued self-improvement. Open VLThinker-7B consistently advances performance across six benchmarks demanding mathematical and general reasoning, notably improving Math Vista by 3.8%, EMMA by 2.4%, and Hallusion Bench by 1.6%.
Researcher Affiliation Academia Yihe Deng, Hritik Bansal, Fan Yin Nanyun Peng, Wei Wang, Kai-Wei Chang University of California, Los Angeles
Pseudocode No The paper includes diagrams (e.g., Figure 2 and Figure 5) to illustrate processes but does not present any explicitly labeled 'Pseudocode' or 'Algorithm' block.
Open Source Code Yes The code, model and data are held at https://github.com/yihedeng9/Open VLThinker.
Open Datasets Yes We employ six established benchmarks to examine model s ability thoroughly: Math reasoning: Math Vista [44], Math Verse [88] and Math Vision [69]. The three benchmarks evaluate how LVLMs interpret and reason with diagrams in visual math problems through both multiple-choice and free-form questions. General reasoning: MMMU-Pro [82] and EMMA [22]. MMMU-Pro spans 30 subjects across 183 subfields, including business, medicine, and science. EMMA evaluates in physics, chemistry, coding, and math. Perception: Hallusion Bench [19], designed to evaluate LVLMs susceptibility to language hallucination and visual illusion. We source our training data from the established LLa VA-One Vision [34] and specifically consider the 14 data sources in overlap with Math V360K [58] (Table 4). Based on our preliminary experiments, we equally draw 500 examples from each source to form the SFT seed dataset of 7K examples.
Dataset Splits Yes We source our training data from the established LLa VA-One Vision [34] and specifically consider the 14 data sources in overlap with Math V360K [58] (Table 4). Based on our preliminary experiments, we equally draw 500 examples from each source to form the SFT seed dataset of 7K examples, where for each iteration we collect distillation data via rejection sampling, resulting in a final 3K SFT data. We then classify the data sources into easy, medium and hard (as detailed in Table 4). We construct the 3K medium-level RL training data from the 5 sources that we identified as medium difficulty. Finally, we construct 6K hard-level RL training data from the 3 most difficult sources, summing up to 12K data in total for each iteration that trains from the base model.
Hardware Specification Yes The experiments were conducted on an 8 H100 (or equivalent) GPU node. Experiments were conducted on GPU clusters to the similar level of NVIDIA H100 80GB GPU.
Software Dependencies No Our training framework is based on LLa MA-Factory3 for SFT and Easy R14 for RL. While these frameworks are mentioned, specific version numbers for them or other key software components are not provided.
Experiment Setup Yes Training setup. We take Qwen2.5-VL-7B [3] as the base model and perform three iterations of the SFT-RL cycle as illustrated in Section 4, applying full fine-tuning for both SFT and RL. Our training framework is based on LLa MA-Factory3 for SFT and Easy R14 for RL. We source our training data from the established LLa VA-One Vision [34] and specifically consider the 14 data sources in overlap with Math V360K [58] (Table 4). Based on our preliminary experiments, we equally draw 500 examples from each source to form the SFT seed dataset of 7K examples, where for each iteration we collect distillation data via rejection sampling, resulting in a final 3K SFT data. We then classify the data sources into easy, medium and hard (as detailed in Table 4). We construct the 3K medium-level RL training data from the 5 sources that we identified as medium difficulty. Finally, we construct 6K hard-level RL training data from the 3 most difficult sources, summing up to 12K data in total for each iteration that trains from the base model. We defer the training hyperparameters to Appendix C. Appendix C, Table 13 (Supervised fine-tuning hyperparameters) and Table 14 (GRPO hyperparameters) list specific values for Learning rate, Global batch size, Scheduler, Warmup ratio, Num train epochs, Max grad norm, Weight decay, Rollout temperature, and Image max pixels.