Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Universal Video Temporal Grounding with Generative Multi-modal Large Language Models
Authors: Zeqian Li, Shangzhe Di, Zhonghua Zhai, Weilin Huang, Yanfeng Wang, Weidi Xie
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Comprehensive experiments show that Uni Time outperforms state-of-the-art approaches in both zero-shot and dataset-specific finetuned settings across five public temporal grounding benchmarks. |
| Researcher Affiliation | Collaboration | Zeqian Li1, Shangzhe Di1,2, Zhonghua Zhai2, Weilin Huang2, Yanfeng Wang1, Weidi Xie1 1SAI, Shanghai Jiao Tong University 2Byte Dance Seed |
| Pseudocode | No | The paper describes the methods through textual explanations and figures (e.g., Figure 2: Overview of the proposed Uni Time framework), but does not include a dedicated pseudocode or algorithm block. |
| Open Source Code | Yes | https://lzq5.github.io/Uni Time The code is released at https://github.com/Lzq5/Uni Time. |
| Open Datasets | Yes | To support universal temporal grounding, we compile a diverse dataset spanning varied scenes, genres, durations, and query types (e.g., descriptions, questions, procedural instructions), as detailed in Table 1. For evaluation, we benchmark our method across three categories: (i) Short-video temporal grounding, including Charades-STA [51], Activity Net-Captions [22], and QVHighlights [24]. (ii) Long-video temporal grounding, including Ego4D-NLQ [11] and Ta Co S [47]. (iii) Video question answering (Video QA), including two grounded benchmarks with temporally annotated queries (Qa Ego4D [3], CG-Bench [5]) and two general benchmarks (MLVU [71], Long Video Bench [57]). |
| Dataset Splits | Yes | For evaluation, we benchmark our method across three categories: (i) Short-video temporal grounding, including Charades-STA [51], Activity Net-Captions [22], and QVHighlights [24]. (ii) Long-video temporal grounding, including Ego4D-NLQ [11] and Ta Co S [47]. (iii) Video question answering (Video QA), including two grounded benchmarks with temporally annotated queries (Qa Ego4D [3], CG-Bench [5]) and two general benchmarks (MLVU [71], Long Video Bench [57]). The statistics of evaluation benchmarks we used are listed in Table 2. |
| Hardware Specification | No | The paper mentions experimental settings like batch size and optimization, but does not specify any particular hardware (e.g., GPU models, CPU types) used for running the experiments. |
| Software Dependencies | No | Our framework is built on Py Torch, with Qwen2-VL-7B [56] as the base model. All experiments are conducted with a batch size of 8, using Adam W [32] optimizer with the learning rate 2e-4, and trained for one epoch with linear warmup during the first 3% of steps. The vision encoder is frozen, and the LLM is fine-tuned via Lo RA [16] (rank = 8, alpha = 8). |
| Experiment Setup | Yes | Our framework is built on Py Torch, with Qwen2-VL-7B [56] as the base model. All experiments are conducted with a batch size of 8, using Adam W [32] optimizer with the learning rate 2e-4, and trained for one epoch with linear warmup during the first 3% of steps. The vision encoder is frozen, and the LLM is fine-tuned via Lo RA [16] (rank = 8, alpha = 8). We sample frames at 2 fps, with N short f = 128 and N long f = 1024, capping input video at 16,384 tokens. The default segment length and replication factor are set to 32 and 4, respectively. |