Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Temporal Chain of Thought: Long-Video Understanding by Thinking in Frames

Authors: Anurag Arnab, Ahmet Iscen, Mathilde Caron, Alireza Fathi, Cordelia Schmid

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate how leveraging more computation at inference-time to select the most relevant context leads to improvements in accuracy, in agreement with recent work on inference-time scaling of LLMs. Moreover, we achieve state-of-the-art results on 4 diverse video question-answering datasets, showing consistent improvements with 3 different VLMs.
Researcher Affiliation Industry Anurag Arnab Ahmet Iscen Mathilde Caron Alireza Fathi Cordelia Schmid Google Deep Mind
Pseudocode No The paper describes its methodology in Section 3 and uses structured prompts (Figure 3), but it does not contain explicit pseudocode or algorithm blocks labeled as such, nor does it present the steps of an algorithm in a code-like format beyond the prompt structure.
Open Source Code No We are not, however, releasing our experimental code.
Open Datasets Yes We use the following long-video QA datasets: Egoschema [41], LVBench [58], Open EQA [40], NEx T-QA [66]. We have used publicly-available academic datasets (Egoschema, Next-QA, LVBench and Open-EQA) for all of our experiments. Next-QA [66] License: MIT LVBench [58]. License: CC BY-SA 4.0 Open EQA [40]. License: MIT Egoschema [41]. License: Unknown.
Dataset Splits No The paper states: "Egoschema [41]... We run ablations on the subset of 500 labelled examples, and also report results on the full set of 5000 videos via the evaluation server." While it mentions specific subsets, it does not explicitly provide details on training, validation, and test splits (e.g., percentages, exact counts, or specific predefined split references) for any of the datasets used to allow for reproduction of the data partitioning.
Hardware Specification Yes Qwen-2.5-VL [6] has publicly-available weights, and we run the Hugging Face implementation using a server with 8x NVIDIA A100 GPUs.
Software Dependencies No The paper mentions using specific models (Gemini 1.5 Flash, GPT-4o-mini, Qwen-2.5-VL) and the Hugging Face implementation for Qwen-2.5-VL, but does not provide specific version numbers for ancillary software dependencies such as Python, PyTorch, or CUDA versions.
Experiment Setup Yes We use Gemini 1.5 Flash as our primary VLM, specifically the Gemini-1.5-flash-002 checkpoint via the Vertex API [18]. ... Gemini uses 258 tokens per frame, and unless otherwise specified, we use a context budget of 32K tokens. This corresponds to 120 frames... For all datasets, we sample videos at 1 frame per second (fps)... we use self-consistency [59], where we sample multiple predictions from the VLM and take the majority vote as the final answer... increasing the sampling temperature to 0.7 [59]... We used s = 64 frames, and l = 12 segments for our experiment, ablating this choice in App. A. Table 6: Effect of hyperparameters. We analyse the effect of the segment size, s (a), and the number of uniform context frames, u (b). The context-limit is k = 120, meaning that the remaining m = k u frames are selected by the model.