Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Grounded Reinforcement Learning for Visual Reasoning

Authors: Gabriel Sarch, Snigdha Saha, Naitik Khandelwal, Ayush Jain, Michael J. Tarr, Aviral Kumar, Katerina Fragkiadaki

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate Vi Go RL across a suite of visual reasoning benchmarks, including SAT-2 [48], BLINK [15], Robo Spatial [53], Screen Spot [10, 33], Visual Web Arena [31], and V Bench [66]. Our approach consistently outperforms existing methods on all tasks. Specifically, Vi Go RL achieves substantial improvements over vanilla GRPO, with accuracy gains of 12.9 points on SAT-2 and 2.0 points on BLINK.
Researcher Affiliation Academia Gabriel Sarch Snigdha Saha Naitik Khandelwal Ayush Jain Michael J. Tarr Aviral Kumar Katerina Fragkiadaki Carnegie Mellon University
Pseudocode Yes Algorithm Phases. MCTS operates via the standard four-phase loop: Selection: Starting at the root, the search follows a path through children using the UCB policy... Expansion: At each expandable node, we generate up to three children using the VLM... Rollout: For each new child, we simulate reasoning steps using the VLM... Backpropagation: Final 0/1 rewards are backpropagated up the tree along the visited path...
Open Source Code Yes Justification: We include anonymized links in the supplemental material with all code, trained models, and instructions to reproduce experiments, as well as generated datasets used in our experiments.
Open Datasets Yes We evaluate Vi Go RL across a suite of visual reasoning benchmarks, including SAT-2 [48], BLINK [15], Robo Spatial [53], Screen Spot [10, 33], Visual Web Arena [31], and V Bench [66]... For web grounding, we draw 12k screenshot, referring expression, box examples from OS-ATLAS [69]... We use ICAL [49], a dataset of 92 web navigation trajectories... The visual search benchmark V Bench tests fine-grained visual understanding using 191 high-resolution images from the SA-1B dataset [30].
Dataset Splits Yes For spatial reasoning, we use SAT-2 [48], sampling 32k training and 1k validation examples. The model is tasked with selecting the correct textual option, with randomized answer order to reduce position bias. For web grounding, we draw 12k screenshot, referring expression, box examples from OS-ATLAS [69] (4k each from mobile, web, and desktop), plus 1.5k warm start and 1.5k validation samples evenly split by domain.
Hardware Specification Yes Training is conducted on 8 A100 GPUs with Qwen2.5-VL models (3B and 7B).
Software Dependencies Yes We build on Llama-Factory [88] for SFT, and Easy R1 [89, 52] for GRPO.
Experiment Setup Yes Supervised fine-tuning uses 3 epochs, while GRPO is applied for 500 rollout-update iterations. Evaluation uses decoding temperature of 0.5. We build on Llama-Factory [88] for SFT, and Easy R1 [89, 52] for GRPO. Table A1: Supervised Fine Tuning (SFT). Table A2: GRPO Training. Table A3: Multi-turn GRPO.