Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Understanding Long Videos with Multimodal Language Models

Authors: Kanchana Ranasinghe, Xiang Li, Kumara Kahatapitiya, Michael Ryoo

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this work, we systematically study this question in the context of video question-answering (Qn A) benchmarks, building two modality-constrained baselines to highlight our findings. These two frameworks are tagged Just-LLM and Single-Frame-VLM. We discover how these models perform significantly better than random prediction on multiple long-video understanding benchmarks (see Table 1, similar findings in Min et al. (2024)).
Researcher Affiliation Academia Kanchana Ranasinghe, Xiang Li, Kumara Kahatapitiya & Michael S. Ryoo EMAIL
Pseudocode Yes We first describe our exact templates as Python pseudo-code in Table A.1. Global Object Information (x GOI) "Consider following objects in video to answer the question:" + \ ", ".join(GOI data) + ". " + task question
Open Source Code Yes Code: github.com/kahnchana/mvu
Open Datasets Yes We evaluate on two video question answering datasets focused on long-form videos: Ego Schema (Mangalam et al., 2023) and NEx T-QA (Xiao et al., 2021). We also evaluate using a series of robotics datasets from the Open X-Embodiment robotics dataset (Open-X-Embodiment-Collaboration et al., 2023) to test our model generality (more details in Section 5.2).
Dataset Splits Yes Ego Schema is a long-form egocentric video question-answering benchmark, consisting of a 500-video public subset (Ego Schema S) and a full 5000+ video evaluation set (Ego Schema-F) accessed only through evaluation servers. This dataset spans over 250 hours and is specially constructed to ensure that questions require awareness of a longer temporal window for correctly answering (Mangalam et al., 2023). NEx T-QA similarly contains long-form videos with a focus on requiring causal & temporal action reasoning as well as common scene comprehension for correctly answering. It contains a validation set (NEx T-QA-V) of 4996 video-questions pairs and a test set (NEx T-QA-T) of 8564 video-question pairs.
Hardware Specification Yes For our evaluations, we directly use these models, utilizing two NVIDIA RTX A5000 24GB GPUs for inference.
Software Dependencies No The paper mentions several models used (e.g., LLa VA-v1.5-13B, OWL-Vi T-B/32, Llama-2-7b-Chat, Gemma-7b-IT, Mistral-7B-Instruct) and states they are hosted on Hugging Face, but specific version numbers for general software dependencies like Python, PyTorch, or CUDA are not provided.
Experiment Setup Yes Our proposed MVU framework and its variants use off-the-shelf models trained on images, thus requiring no re-training of these models. For our evaluations, we directly use these models, utilizing two NVIDIA RTX A5000 24GB GPUs for inference. We evaluate on two video question answering datasets focused on long-form videos: Ego Schema (Mangalam et al., 2023) and NEx T-QA (Xiao et al., 2021). We also evaluate using a series of robotics datasets from the Open X-Embodiment robotics dataset (Open-X-Embodiment-Collaboration et al., 2023) to test our model generality (more details in Section 5.2). We use LLa VA-v1.5-13B (Liu et al., 2023a) for likelihood selection and frame object list generation. For object localization, we use OWL-Vi T-B/32 (Minderer et al., 2022). Unless explicitly specified, we use the above setup in all our experiments. Variants of our framework uses LLMs Llama-2-7b-Chat, Gemma-7b-IT, and Mistral-7B-Instruct (default) for likelihood selection. Appendix A and F describe the prompt templates and likelihood selection implementation details, which are key parts of the experimental setup.