Momentor: Advancing Video Large Language Model with Fine-Grained Temporal Reasoning

Authors: Long Qian, Juncheng Li, Yu Wu, Yaobo Ye, Hao Fei, Tat-Seng Chua, Yueting Zhuang, Siliang Tang

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct extensive experiments with our proposed Momentor. The results indicate that our Momentor outperforms previous Video-LLMs in multiple tasks involving precise temporal position, such as temporal grounding, dense captioning, action segmentation, and highlight moment retrieval. (Abstract) and 5. Experiments section.
Researcher Affiliation Academia 1Zhejiang University 2Wuhan University 3National University of Singapore.
Pseudocode No The paper describes methods like 'Event Boundary Detection' but does not present them in a structured pseudocode or algorithm block format.
Open Source Code Yes Our project is available at https://github.com/DCDmllm/Momentor. (Abstract)
Open Datasets Yes We conduct extensive experiments... using datasets such as Breakfast (Kuehne et al., 2014), 50 Salads (Stein & Mc Kenna, 2013), Activity Net Captions (Krishna et al., 2017), Charades-STA (Gao et al., 2017), and QVHighlights (Lei et al., 2021). ... We select a substantial number of videos from YTTemporal-1B (Zellers et al., 2022) to build Moment-10M. (Section 5.1 and 4.2)
Dataset Splits No The paper does not explicitly provide specific train/validation/test dataset splits with percentages, sample counts, or references to predefined splits.
Hardware Specification Yes We train Momentor on 8 A100 GPUs for around 60 hours. (Appendix B)
Software Dependencies No The paper mentions software components like CLIP, LLaMA, sentence transformer, Py Scene Detect, and Grounding DINO, but does not provide specific version numbers for these software dependencies.
Experiment Setup Yes We incorporate N = 300 temporal tokens for temporal positioning. For each video, we uniformly sample M = 300 frames for fine-grained reasoning. We freeze the frame encoder and LLM during training, while only the linear projection layer and TPM are updated. (Appendix B)