Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Momentor: Advancing Video Large Language Model with Fine-Grained Temporal Reasoning
Authors: Long Qian, Juncheng Li, Yu Wu, Yaobo Ye, Hao Fei, Tat-Seng Chua, Yueting Zhuang, Siliang Tang
ICML 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct extensive experiments with our proposed Momentor. The results indicate that our Momentor outperforms previous Video-LLMs in multiple tasks involving precise temporal position, such as temporal grounding, dense captioning, action segmentation, and highlight moment retrieval. (Abstract) and 5. Experiments section. |
| Researcher Affiliation | Academia | 1Zhejiang University 2Wuhan University 3National University of Singapore. |
| Pseudocode | No | The paper describes methods like 'Event Boundary Detection' but does not present them in a structured pseudocode or algorithm block format. |
| Open Source Code | Yes | Our project is available at https://github.com/DCDmllm/Momentor. (Abstract) |
| Open Datasets | Yes | We conduct extensive experiments... using datasets such as Breakfast (Kuehne et al., 2014), 50 Salads (Stein & Mc Kenna, 2013), Activity Net Captions (Krishna et al., 2017), Charades-STA (Gao et al., 2017), and QVHighlights (Lei et al., 2021). ... We select a substantial number of videos from YTTemporal-1B (Zellers et al., 2022) to build Moment-10M. (Section 5.1 and 4.2) |
| Dataset Splits | No | The paper does not explicitly provide specific train/validation/test dataset splits with percentages, sample counts, or references to predefined splits. |
| Hardware Specification | Yes | We train Momentor on 8 A100 GPUs for around 60 hours. (Appendix B) |
| Software Dependencies | No | The paper mentions software components like CLIP, LLaMA, sentence transformer, Py Scene Detect, and Grounding DINO, but does not provide specific version numbers for these software dependencies. |
| Experiment Setup | Yes | We incorporate N = 300 temporal tokens for temporal positioning. For each video, we uniformly sample M = 300 frames for fine-grained reasoning. We freeze the frame encoder and LLM during training, while only the linear projection layer and TPM are updated. (Appendix B) |