Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

VLM-R³: Region Recognition, Reasoning, and Refinement for Enhanced Multimodal Chain-of-Thought

Authors: Chaoya Jiang, Yongrui Heng, Wei Ye, Haiyang Xu, Ming Yan, Ji Zhang, Fei Huang, Shikun Zhang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on Math Vista, Science QA, and other benchmarks show that VLM-R3 sets a new state of the art in zero-shot and few-shot settings, with the largest gains appearing on questions demanding subtle spatial reasoning or fine-grained visual cue extraction.
Researcher Affiliation Collaboration 1 National Engineering Research Center for Software Engineering, Peking University 2 Alibaba Group 3 ZEEKR Intelligent Technology Holding Limited
Pseudocode No The paper describes the method and training paradigms using textual explanations and mathematical equations, but it does not include a distinct block labeled 'Pseudocode' or 'Algorithm'.
Open Source Code No The paper does not provide an explicit statement about releasing code for VLM-R3 or a direct link to a code repository. While it builds upon Qwen2.5-VL which is open source, it does not state that *their* implementation of VLM-R3 is open-source.
Open Datasets Yes First, we introduce Visuo-Lingual Interleaved Rationale (VLIR), a pioneering dataset meticulously curated to support the development of MLLMs for interleaved text-image Co T reasoning. VLIR provides explicit annotations for visual region localization, image cropping instructions, and semantic enhancement cues, all embedded within multi-step reasoning narratives. We select data from a diverse set of existing benchmarks to cover a wide range of visual reasoning challenges: Text VQA [48], Doc VQA [33] for tasks requiring OCR and document structure understanding. General Visual Question Answering: GQA [17] for complex multistep reasoning over visual scenes. Chart and Infographic Interpretation: Infographics VQA [32] for understanding structured visual data. Spatial Relation Reasoning: VSR [26] for tasks focused on identifying and reasoning about spatial relationships between objects.
Dataset Splits No Our supervised fine-tuning experiments used the complete VLIR dataset. Our experiments were conducted on 4 NVIDIA A100 GPUs, each equipped with 80GB of memory, leveraging Deep Speed[42] for efficient training. We used a batch size of 2 with a gradient accumulation of 8, a learning rate of 2 10 7, and trained for 3 epochs. During this phase, the vision encoder and MLP projector were frozen, and only the Large Language Model (LLM) component was trained. For the R-GRPO stage, we sampled approximately 5,000 data points from Text VQA [48], GQA [17], VSR [26], Doc VQA [33] and M3Co T [9] datasets.
Hardware Specification Yes Our experiments were conducted on 4 NVIDIA A100 GPUs, each equipped with 80GB of memory, leveraging Deep Speed[42] for efficient training. Our experiments for R-GRPO were performed on 6 NVIDIA A100 GPUs, each with 80GB of memory, also utilizing Deep Speed[42].
Software Dependencies No Our experiments were conducted on 4 NVIDIA A100 GPUs, each equipped with 80GB of memory, leveraging Deep Speed[42] for efficient training. Our experiments for R-GRPO were performed on 6 NVIDIA A100 GPUs, each with 80GB of memory, also utilizing Deep Speed[42].
Experiment Setup Yes We used a batch size of 2 with a gradient accumulation of 8, a learning rate of 2 10 7, and trained for 3 epochs. During this phase, the vision encoder and MLP projector were frozen, and only the Large Language Model (LLM) component was trained. For the R-GRPO stage, we sampled approximately 5,000 data points from Text VQA [48], GQA [17], VSR [26], Doc VQA [33] and M3Co T [9] datasets. Regarding the hyperparameters for the GRPO formulation(2), we set M = 5. Following the experience of related studies, we set β = 0.0, i.e., we eliminate the KL divergence constraint. The batch size per device was set to 1, with a gradient accumulation of 16. The learning rate was 1 10 6, and training continued for 300 steps. We employ a rule-based reinforcement learning approach, where the correctness of the final answer was judged using an exact match criterion. Similar to the supervised fine-tuning stage, the vision encoder and MLP projector were frozen, and only the LLM component was trained.