Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Universal Video Temporal Grounding with Generative Multi-modal Large Language Models

Authors: Zeqian Li, Shangzhe Di, Zhonghua Zhai, Weilin Huang, Yanfeng Wang, Weidi Xie

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Comprehensive experiments show that Uni Time outperforms state-of-the-art approaches in both zero-shot and dataset-specific finetuned settings across five public temporal grounding benchmarks.
Researcher Affiliation	Collaboration	Zeqian Li1, Shangzhe Di1,2, Zhonghua Zhai2, Weilin Huang2, Yanfeng Wang1, Weidi Xie1 1SAI, Shanghai Jiao Tong University 2Byte Dance Seed
Pseudocode	No	The paper describes the methods through textual explanations and figures (e.g., Figure 2: Overview of the proposed Uni Time framework), but does not include a dedicated pseudocode or algorithm block.
Open Source Code	Yes	https://lzq5.github.io/Uni Time The code is released at https://github.com/Lzq5/Uni Time.
Open Datasets	Yes	To support universal temporal grounding, we compile a diverse dataset spanning varied scenes, genres, durations, and query types (e.g., descriptions, questions, procedural instructions), as detailed in Table 1. For evaluation, we benchmark our method across three categories: (i) Short-video temporal grounding, including Charades-STA [51], Activity Net-Captions [22], and QVHighlights [24]. (ii) Long-video temporal grounding, including Ego4D-NLQ [11] and Ta Co S [47]. (iii) Video question answering (Video QA), including two grounded benchmarks with temporally annotated queries (Qa Ego4D [3], CG-Bench [5]) and two general benchmarks (MLVU [71], Long Video Bench [57]).
Dataset Splits	Yes	For evaluation, we benchmark our method across three categories: (i) Short-video temporal grounding, including Charades-STA [51], Activity Net-Captions [22], and QVHighlights [24]. (ii) Long-video temporal grounding, including Ego4D-NLQ [11] and Ta Co S [47]. (iii) Video question answering (Video QA), including two grounded benchmarks with temporally annotated queries (Qa Ego4D [3], CG-Bench [5]) and two general benchmarks (MLVU [71], Long Video Bench [57]). The statistics of evaluation benchmarks we used are listed in Table 2.
Hardware Specification	No	The paper mentions experimental settings like batch size and optimization, but does not specify any particular hardware (e.g., GPU models, CPU types) used for running the experiments.
Software Dependencies	No	Our framework is built on Py Torch, with Qwen2-VL-7B [56] as the base model. All experiments are conducted with a batch size of 8, using Adam W [32] optimizer with the learning rate 2e-4, and trained for one epoch with linear warmup during the first 3% of steps. The vision encoder is frozen, and the LLM is fine-tuned via Lo RA [16] (rank = 8, alpha = 8).
Experiment Setup	Yes	Our framework is built on Py Torch, with Qwen2-VL-7B [56] as the base model. All experiments are conducted with a batch size of 8, using Adam W [32] optimizer with the learning rate 2e-4, and trained for one epoch with linear warmup during the first 3% of steps. The vision encoder is frozen, and the LLM is fine-tuned via Lo RA [16] (rank = 8, alpha = 8). We sample frames at 2 fps, with N short f = 128 and N long f = 1024, capping input video at 16,384 tokens. The default segment length and replication factor are set to 32 and 4, respectively.