Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

CausalVTG: Towards Robust Video Temporal Grounding via Causal Inference

Authors: Qiyi Wang, Senda Chen, Ying Shen

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on five widely-used benchmarks demonstrate that Causal VTG outperforms state-of-the-art methods, achieving higher localization precision under stricter Io U thresholds and more accurately identifying whether a query is truly grounded in the video. These results demonstrate both the effectiveness and generalizability of proposed Causal VTG.
Researcher Affiliation Academia Qiyi Wang Tongji University EMAIL Senda Chen Tongji University EMAIL Ying Shen Tongji University EMAIL
Pseudocode No The paper describes methods and models but does not present any structured pseudocode or algorithm blocks.
Open Source Code Yes The code is available at https://github.com/Mx Learner/Causal VTG.
Open Datasets Yes Experiments are conducted on five benchmarks: QVHighlights [4] annotated for both MR and HD, serving as a comprehensive benchmark for multi-task evaluation; Charades-STA [3] and Activity Net-Caption [46] annotated with precise temporal segments for MR; and Charades-RF and Activity Net-RF [27] extend their original datasets by introducing false-query scenarios for evaluating whether the query is grounded.
Dataset Splits Yes Activity Net-Captions [46] is built upon Activity Net v1.3 and contains over 20,000 untrimmed videos with 100,000 sentence-level annotations. Each video averages 120 seconds in length and includes multiple annotated events. The dataset is split into 10,024 training, 4,926 validation, and 5,044 test videos.
Hardware Specification Yes All models are implemented using Py Torch and trained for 50 epochs on a single NVIDIA RTX 4070 SUPER GPU (12GB VRAM, 32GB RAM). ... We evaluate computational overhead on the QVHighlights dataset using an NVIDIA A800 GPU (80GB) with batch size 64 for 50 epochs.
Software Dependencies No All models are implemented using Py Torch... Intern Video2-CLIP [47] serves as the unified backbone for both video and text encoding... The paper mentions software like PyTorch and InternVideo2-CLIP, but does not provide specific version numbers for these dependencies.
Experiment Setup Yes All models are implemented using Py Torch and trained for 50 epochs on a single NVIDIA RTX 4070 SUPER GPU (12GB VRAM, 32GB RAM). Intern Video2-CLIP [47] serves as the unified backbone for both video and text encoding, with all hidden dimensions set to 256. The temporal stride set for the multi-scale feature pyramid is S = {1, 2, 4, 8} for standard-length videos, and extended to {1, 2, 4, 8, 16} for longer videos in Activity Net-Caption and Activity Net-RF. The training objective is a weighted combination of four losses, with coefficients λhd = 0.1, λmr = 0.2, λfg = 1.0, and λqr = 0.1.