Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Tool-Augmented Spatiotemporal Reasoning for Streamlining Video Question Answering Task

Authors: Sunqi Fan, Jiashuo Cui, Meng-Hao Guo, Shuojin Yang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through extensive experiments across multiple Video QA benchmarks, we demonstrate that augmenting GPT-4o with lightweight Video Toolkit, results in an 8.2% gain on Video MME [12] and 4.6% on Long Video Bench [77].
Researcher Affiliation Academia Sunqi Fan, Jiashuo Cui, Meng-Hao Guo, Shuojin Yang BNRist, Department of Computer Science and Technology, Tsinghua University EMAIL, EMAIL
Pseudocode Yes We conclude this algorithmic procedure in Figure 2 and Algorithm 1. Algorithm 1 Spatiotemporal Reasoning (STAR)
Open Source Code Yes The code is publicly available at https://github.com/fansunqi/Video Tool.
Open Datasets Yes We evaluate our method on four widely used Video QA datasets: Video MME [12], NEx T-QA [78], Long Video Bench [77], and Ego Schema [44].
Dataset Splits Yes NEx T-QA [78] dataset comprises 5,440 videos and approximately 52,000 humanannotated QA pairs. It is specifically curated to benchmark models capabilities in understanding the causal structures and temporal dependencies inherent in complex video events. In this work, we adopt the multiple-choice question answering subset of NEx T-QA, which includes 34K training, 5K validation, and 9K testing samples.
Hardware Specification Yes Our STAR framework can run on one NVIDIA RTX 4090 GPU, while the variant STAR-MINI framework can run on a personal computer like MAC.
Software Dependencies Yes STAR is the full version, employing tools based on open-source models with up to 3B parameters (e.g., Qwen VL-2.5-3B [2]) and using GPT-4o-2024-08-06 [50] APIs.
Experiment Setup Yes For videos longer than 16 seconds, we extract 16 initial frames uniformly; for videos with a duration of 16 seconds or less, initial frames are extracted at a rate of 1 fps.