Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

NarrativeBridge: Enhancing Video Captioning with Causal-Temporal Narrative

Authors: Asmar Nadeem, Faegheh Sardari, Robert Dawes, Syed Husain, Adrian Hilton, Armin Mustafa

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments demonstrate that CEN significantly outperforms state-of-the-art models in articulating the causal and temporal aspects of video content: 17.88 and 17.44 CIDEr on the MSVD-CTN and MSRVTT-CTN datasets, respectively. Cross-dataset evaluations further showcase CEN s strong generalization capabilities.
Researcher Affiliation Collaboration Asmar Nadeem1, Faegheh Sardari1, Robert Dawes2, Syed Sameed Husain1, Adrian Hilton1, Armin Mustafa1 1CVSSP, University of Surrey, Guildford, UK 2BBC Research and Development, UK EMAIL
Pseudocode No The paper describes the method using block diagrams (Figure 3) and mathematical equations (e.g., equations 1-9), but does not present any structured pseudocode or algorithm blocks.
Open Source Code Yes The CTN captions benchmark dataset provides a comprehensive testbed for evaluating video understanding models ability to grasp complex temporal and causal dynamics, which will be released for research purpose on https://narrativebridge.github.io/. And CEN explicitly models cause-effect relationships and temporal dynamics to generate rich, contextually relevant descriptions capturing nuanced causal-temporal narrative in videos, demonstrating significant performance improvement over SOTA methods (Section 4). Narrative Bridge lays the foundation for a paradigm shift, where models comprehend underlying causal-temporal narrative driving events, unlocking new frontiers in contextually aware human-machine interactions with video. For future work, we aim to integrate CTN caption generation with existing image captioning techniques, to annotate unlabeled videos with causal-temporal narrative labels. A few frames of the video will be labelled using off-the-shelf image captioning methods, and CTN caption generation will exploit the labelled frames to generate one coherent caption for the unlabelled video (see Appendix A.8). This synergistic approach opens new avenues for comprehensive video understanding and annotation, enabling more robust and accurate video analysis pipelines.
Open Datasets Yes Our Causal-Temporal Narrative (CTN), a novel captions benchmark dataset, leverages a large language model (LLM) and few-shot prompting to generate enhanced video descriptions that explicitly encode causal and temporal sequences, as shown in Figure 1. This establishes a clear connection between the cause (reckless driving) and the effect (damaged car and subsequent behavior of the group). Our CTN captions benchmark dataset enables models to better understand and articulate the causality, sequence, and significance of events within the broader video context Wilkens et al. (2003).
Dataset Splits Yes For MSVD, we use 1200 videos for training, 100 for validation, and 670 for testing. For MSRVTT, we use 6513 videos for training, 497 for validation, and 2990 for testing.
Hardware Specification Yes We generate the CTN captions using the Mixtral of Experts LLM Jiang et al. (2024), running on A100-80GB GPUs. All the experiments in Stage 1 and Stage 2 of our CEN are run using A100-80GB and RTX 3090-24GB GPUs respectively. We implement SEM-POS Nadeem et al. (2023), AKGNN Hendria et al. (2023) and GIT Wang et al. (2022) using RTX 3090-24GB, A100-80GB and A100-80GB GPUs respectively. For comparison with Vision-Language Models (VLMs), we implement two fine-tuning approaches: Lo RA Fine-Tuning and Simple Fine-Tuning using A100-80GB GPUs.
Software Dependencies No The paper mentions specific models like Mixtral of Experts, CLIP-Vi T, Uni-VL, and optimizers like Adam and AdamW, but does not provide specific version numbers for programming languages (e.g., Python), deep learning frameworks (e.g., PyTorch, TensorFlow), or other key libraries used in the implementation.
Experiment Setup Yes Our CEN model is trained using the Adam optimizer with learning rates of 1 10 4 (stage 1) and 1 10 6 (stage 2) and a batch size of 64 for 10 epochs (stage 1) and 50 epochs (stage 2). For comparison with recent Vision Language Models (VLMs), we fine-tune Video LLa VA Lin et al. (2023) and Share GPT4Video Chen et al. (2024) using both Lo RA and simple fine-tuning approaches on our CTN benchmark dataset. We use the recommended hyperparameters for each model during fine-tuning. Further details are provided in the Appendix A.1. Lo RA Fine-Tuning is applied specifically to the LLM component, with a learning rate of 2e-4 for Lo RA parameters. Simple Fine-Tuning is applied to the entire model, using an Adam W Loshchilov (2017) optimizer with a cosine learning rate schedule (initial learning rate: 1e-3, warmup ratio: 0.03).