Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Open Vision Reasoner: Transferring Linguistic Cognitive Behavior for Visual Reasoning

Authors: Yana Wei, Liang Zhao, Jianjian Sun, Kangheng Lin, jisheng yin, Jingcheng Hu, Yinmin Zhang, En Yu, Haoran Lv, Zejia Weng, Jia Wang, Qi Han, Zheng Ge, Xiangyu Zhang, Daxin Jiang, Vishal M. Patel

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our resulting model, Open-Vision-Reasoner (OVR), achieves state-of-the-art performance on a suite of reasoning benchmarks, including 95.3% on MATH500, 51.8% on Math Vision and 54.6% on Math Verse. We release our model, data, and training dynamics to catalyze the development of more capable, behavior-aligned multimodal reasoners. Figure 1: Performance comparison with state-of-the-art models on both textual (AIME 2024, AIME 2025 [7], MATH500 [38]) and multimodal (Math Vista [64], Math Vision [100], Math Verse [133]) math reasoning benchmarks. Open Vision Reasoner (OVR) demonstrates superior results among open-source models and performs competitively with commercial counterparts. Section 4: Experiments. Section 4.2: Enhanced Language Reasoning and General Capabilities. Section 4.3: Superior Visual Reasoning Abilities. Section 5.1: Analysis of Training Dynamics.
Researcher Affiliation	Collaboration	Yana Wei1,, Liang Zhao2, , Jianjian Sun2,, Kangheng Lin3, Jisheng Yin4, Jingcheng Hu5, Yinmin Zhang2, En Yu6, Haoran Lv2, Zejia Weng2, Jia Wang2, Qi Han2, Zheng Ge2, Xiangyu Zhang2, Daxin Jiang2, Vishal M. Patel1 1Johns Hopkins University 2Step Fun 3BUPT 4UCAS 5THU 6HUST Core contribution, Corresponding authors: EMAIL, EMAIL
Pseudocode	No	The paper describes the RL algorithm in Section 3.2, including mathematical formulas for PPO and GAE, but it does not provide a clearly labeled pseudocode block or algorithm steps in a structured format.
Open Source Code	No	We plan to publicly release all the source code and dataset upon paper acceptance.
Open Datasets	Yes	For language-only scenarios, we utilize public benchmarks including AIME (up to 2023), MATH [38], Numina Math [52], Tulu3 MATH [48], and Open R1-Math-220k [2], and other open-source datasets. Multimodal scenarios incorporate datasets covering geometry problem solving (Geometry3k [61], Geo QA [9], Geos [112]), visual discrimination (Icon QA [62], Pixmo [21], Chart QA [67]), visual puzzles (Puzzle VQA [19], Algo Puzzle VQA [30]), STEM (TQA [46], Science QA [63], K12 from [68]) and multimodal math (Atom Think [109], in-house curated math).
Dataset Splits	No	The paper mentions collecting and curating approximately 2 million cold-start data and around 300k multimodal RL data, but it does not specify how these datasets are split into training, validation, and test sets for their experiments.
Hardware Specification	Yes	We report that all experiments are conducted on NVIDIA A100 Tensor Core GPU.
Software Dependencies	No	The paper mentions using AdamW optimizer and algorithms like PPO and GAE, but does not specify version numbers for any software libraries (e.g., Python, PyTorch, CUDA, scikit-learn) required to replicate the experiments.
Experiment Setup	Yes	In the first stage of cold start, we independently fine-tune the LLM module for 5 epochs with a batch size of 640, a sequence length of 64k, and a learning rate of 2 10 4 leveraging the default Qwen2.5 configuration [42]. During the subsequent stage of reinforcement learning, following Open-Reasoner Zero [40], we utilize PPO and configure GAE with γ = 1 and λ = 1 to fully capture long-term dependencies crucial for reasoning tasks, enabling stable training. This RL phase proceeds for 900 iterations, during which we adopt a curriculum for the sequence length: it begins at 24k for the first 300 iterations, increases to 32k through iteration 700, and expands to 48k thereafter