Momentor: Advancing Video Large Language Model with Fine-Grained Temporal Reasoning
Authors: Long Qian, Juncheng Li, Yu Wu, Yaobo Ye, Hao Fei, Tat-Seng Chua, Yueting Zhuang, Siliang Tang
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct extensive experiments with our proposed Momentor. The results indicate that our Momentor outperforms previous Video-LLMs in multiple tasks involving precise temporal position, such as temporal grounding, dense captioning, action segmentation, and highlight moment retrieval. (Abstract) and 5. Experiments section. |
| Researcher Affiliation | Academia | 1Zhejiang University 2Wuhan University 3National University of Singapore. |
| Pseudocode | No | The paper describes methods like 'Event Boundary Detection' but does not present them in a structured pseudocode or algorithm block format. |
| Open Source Code | Yes | Our project is available at https://github.com/DCDmllm/Momentor. (Abstract) |
| Open Datasets | Yes | We conduct extensive experiments... using datasets such as Breakfast (Kuehne et al., 2014), 50 Salads (Stein & Mc Kenna, 2013), Activity Net Captions (Krishna et al., 2017), Charades-STA (Gao et al., 2017), and QVHighlights (Lei et al., 2021). ... We select a substantial number of videos from YTTemporal-1B (Zellers et al., 2022) to build Moment-10M. (Section 5.1 and 4.2) |
| Dataset Splits | No | The paper does not explicitly provide specific train/validation/test dataset splits with percentages, sample counts, or references to predefined splits. |
| Hardware Specification | Yes | We train Momentor on 8 A100 GPUs for around 60 hours. (Appendix B) |
| Software Dependencies | No | The paper mentions software components like CLIP, LLaMA, sentence transformer, Py Scene Detect, and Grounding DINO, but does not provide specific version numbers for these software dependencies. |
| Experiment Setup | Yes | We incorporate N = 300 temporal tokens for temporal positioning. For each video, we uniformly sample M = 300 frames for fine-grained reasoning. We freeze the frame encoder and LLM during training, while only the linear projection layer and TPM are updated. (Appendix B) |