Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Seeing the Arrow of Time in Large Multimodal Models

Authors: Zihui (Sherry) Xue, Romy Luo, Kristen Grauman

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments show Arrow RL greatly advances temporal perception: it not only achieves substantial improvements on our challenging Ao TBench but also boosts performance on standard video question answering (VQA) benchmarks (with peak accuracy gains reaching over 20% and 10% respectively).
Researcher Affiliation	Academia	Zihui Xue Mi Luo Kristen Grauman The University of Texas at Austin
Pseudocode	No	The paper describes the 'Arrow RL' framework with an overview diagram (Figure 5) and explains its components and objective function, but does not present a formal pseudocode or algorithm block.
Open Source Code	No	We are committed to reproducibility and will open-source the code, data and model upon acceptance.
Open Datasets	Yes	Our Arrow RL training data comprises a comprehensive suite of tasks, with data selected or adapted for each to emphasize scenarios requiring Ao T awareness: (1) MCQ tasks: we format sequence direction classification (Fig. 1 (a)) as MCQ, using 1.1K training videos selected from UCF101 [58] (following prior use of this dataset [23]); (2) Open-ended QA: we curate a hightemporality subset of LLa VA-Video-178K [85], filtering based on the perplexity difference between forward and reverse video to retain samples with great temporal sensitivity, totaling 11.8K samples; (3) Video Captioning: we employ the training set [18] of RTime, which provides high-temporality videos alongside distinct human captions for their forward and reverse versions, comprising 11.7K samples.
Dataset Splits	Yes	For MCQ-based sequence direction classification, we use selected videos from UCF101 [58]. For video captioning, we leverage the RTime dataset [18] with a varied set of 16 prompts. For open-ended QA, we employ original questions from LLa VA-Ne XT-178K [85]... For the first and second T2V task, we concatenate the forward video and its reversed version separated by a 2-second black frame into a single video input... For the Ao T-sensitive VQA, we select the top 200 high-TDS questions from each source benchmark... This yields a subset of 1,800 VQA samples.
Hardware Specification	Yes	Training consists of 2000 RL steps on 6 NVIDIA GH200 GPUs.
Software Dependencies	No	The paper mentions 'lmms-eval' in Supp B.2 and 'Llama-3.1-70B-Instruct [24]' as an LLM judge, but does not provide specific version numbers for software dependencies or libraries used for implementation.
Experiment Setup	Yes	Hyperparameter α is set as 0.25 and γ is set as 0.75. The response group size G is set as 8. Training consists of 2000 RL steps on 6 NVIDIA GH200 GPUs. Our default input configuration is 16 frames for LLAVA-OV-7B, and 1 FPS (with a maximum of 16 frames) for Qwen2-VL-7B and Qwen2.5-VL-7B. Benchmark-specific adjustments include processing up to 32 frames (sampled at 1 FPS for Qwen models) for TVBench (due to video length) and reporting Vinoground at 4FPS for Qwen models to align with [80] (further frame rate analysis in Fig. 11).