Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

ROVER: Recursive Reasoning Over Videos with Vision-Language Models for Embodied Tasks

Authors: Philip Schroeder, Ondrej Biza, Thomas Weng, Hongyin Luo, Jim Glass

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate ROVER, implemented using an in-context learning approach, on diverse Open X Embodiment videos and on a new dataset derived from Robo Casa that consists of 543 videos showing both expert and perturbed non-expert trajectories across 27 robotic manipulation tasks. ROVER outperforms strong baselines across three video reasoning tasks: task progress estimation, frame-level natural language reasoning, and video question answering.
Researcher Affiliation Collaboration Philip Schroeder1 Ondrej Biza2 Thomas Weng2 Hongyin Luo1 James Glass1 1MIT CSAIL 2RAI Institute EMAIL
Pseudocode Yes Algorithm 1 ROVER
Open Source Code Yes Demos, code, and data available at: https://rover-vlm.github.io
Open Datasets Yes We evaluate ROVER, implemented using an in-context learning approach, on diverse Open X Embodiment videos and on a new dataset derived from Robo Casa that consists of 543 videos showing both expert and perturbed non-expert trajectories across 27 robotic manipulation tasks. Demos, code, and data available at: https://rover-vlm.github.io
Dataset Splits Yes Our generated evaluation dataset, comprising trajectories that exhibit a wide range of task expertise, includes 543 videos across 27 tasks (Appendix B), each collected in a random kitchen scene. The videos are separated into levels based on the amount of the task completed during the video (Appendix B). The highest-level videos in each task group show full task completion with near-expert behavior.
Hardware Specification Yes For the open source model, all experiments were performed on A6000 GPUs.
Software Dependencies Yes We use gemini-1.5-pro for Gemini-1.5-Pro and gemini-2.5-pro-preview for Gemini-2.5-Pro-Preview from Google s Gemini API. We use gpt-4o from the Open AI API for GPT-4o. We use Qwen2.5-VL-32B-Instruct for Qwen2.5-VL-32B available on Huggingface. For the open source model, all experiments were performed on A6000 GPUs.
Experiment Setup Yes We implement GVL using the following prompt from [31]. Similar to GVL, we implement ROVER using an in-context learning approach, with the same prompting, shown below, used for all backbone VLMs.