reproducibilityindex.ai

Momentor: Advancing Video Large Language Model with Fine-Grained Temporal Reasoning

Authors: Long Qian, Juncheng Li, Yu Wu, Yaobo Ye, Hao Fei, Tat-Seng Chua, Yueting Zhuang, Siliang Tang

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct extensive experiments with our proposed Momentor. The results indicate that our Momentor outperforms previous Video-LLMs in multiple tasks involving precise temporal position, such as temporal grounding, dense captioning, action segmentation, and highlight moment retrieval. (Abstract) and 5. Experiments section.
Researcher Affiliation	Academia	1Zhejiang University 2Wuhan University 3National University of Singapore.
Pseudocode	No	The paper describes methods like 'Event Boundary Detection' but does not present them in a structured pseudocode or algorithm block format.
Open Source Code	Yes	Our project is available at https://github.com/DCDmllm/Momentor. (Abstract)
Open Datasets	Yes	We conduct extensive experiments... using datasets such as Breakfast (Kuehne et al., 2014), 50 Salads (Stein & Mc Kenna, 2013), Activity Net Captions (Krishna et al., 2017), Charades-STA (Gao et al., 2017), and QVHighlights (Lei et al., 2021). ... We select a substantial number of videos from YTTemporal-1B (Zellers et al., 2022) to build Moment-10M. (Section 5.1 and 4.2)
Dataset Splits	No	The paper does not explicitly provide specific train/validation/test dataset splits with percentages, sample counts, or references to predefined splits.
Hardware Specification	Yes	We train Momentor on 8 A100 GPUs for around 60 hours. (Appendix B)
Software Dependencies	No	The paper mentions software components like CLIP, LLaMA, sentence transformer, Py Scene Detect, and Grounding DINO, but does not provide specific version numbers for these software dependencies.
Experiment Setup	Yes	We incorporate N = 300 temporal tokens for temporal positioning. For each video, we uniformly sample M = 300 frames for fine-grained reasoning. We freeze the frame encoder and LLM during training, while only the linear projection layer and TPM are updated. (Appendix B)