Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Tool-Augmented Spatiotemporal Reasoning for Streamlining Video Question Answering Task
Authors: Sunqi Fan, Jiashuo Cui, Meng-Hao Guo, Shuojin Yang
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through extensive experiments across multiple Video QA benchmarks, we demonstrate that augmenting GPT-4o with lightweight Video Toolkit, results in an 8.2% gain on Video MME [12] and 4.6% on Long Video Bench [77]. |
| Researcher Affiliation | Academia | Sunqi Fan, Jiashuo Cui, Meng-Hao Guo, Shuojin Yang BNRist, Department of Computer Science and Technology, Tsinghua University EMAIL, EMAIL |
| Pseudocode | Yes | We conclude this algorithmic procedure in Figure 2 and Algorithm 1. Algorithm 1 Spatiotemporal Reasoning (STAR) |
| Open Source Code | Yes | The code is publicly available at https://github.com/fansunqi/Video Tool. |
| Open Datasets | Yes | We evaluate our method on four widely used Video QA datasets: Video MME [12], NEx T-QA [78], Long Video Bench [77], and Ego Schema [44]. |
| Dataset Splits | Yes | NEx T-QA [78] dataset comprises 5,440 videos and approximately 52,000 humanannotated QA pairs. It is specifically curated to benchmark models capabilities in understanding the causal structures and temporal dependencies inherent in complex video events. In this work, we adopt the multiple-choice question answering subset of NEx T-QA, which includes 34K training, 5K validation, and 9K testing samples. |
| Hardware Specification | Yes | Our STAR framework can run on one NVIDIA RTX 4090 GPU, while the variant STAR-MINI framework can run on a personal computer like MAC. |
| Software Dependencies | Yes | STAR is the full version, employing tools based on open-source models with up to 3B parameters (e.g., Qwen VL-2.5-3B [2]) and using GPT-4o-2024-08-06 [50] APIs. |
| Experiment Setup | Yes | For videos longer than 16 seconds, we extract 16 initial frames uniformly; for videos with a duration of 16 seconds or less, initial frames are extracted at a rate of 1 fps. |