SlowFocus: Enhancing Fine-grained Temporal Understanding in Video LLM

Authors: Ming Nie, Dan Ding, Chunwei Wang, Yuanfan Guo, Jianhua Han, Hang Xu, Li Zhang

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Comprehensive experiments demonstrate the superiority of our mechanism across both existing public video understanding benchmarks and our proposed Fine Action-CGR.
Researcher Affiliation Collaboration Ming Nie1 Dan Ding1 Chunwei Wang2 Yuanfan Guo2 Jianhua Han2 Hang Xu2 Li Zhang1 1School of Data Science, Fudan University 2Noah s Ark Lab, Huawei
Pseudocode No The paper describes algorithms and training strategies in narrative text and diagrams (e.g., Figure 2, Figure 3), but it does not include formal pseudocode blocks or sections labeled 'Algorithm'.
Open Source Code Yes https://github.com/fudan-zvg/Slow Focus
Open Datasets Yes Following the approach used in LLa MA-VID [16], we utilize the image-text LCS-558K dataset from LLa VA [20], along with 232K video-caption samples from the Web Vid 2.5M [3]. In alignment with practices from VTime LLM [10], we employ the Intern Vid-10M-FLT dataset [34], which is specifically designed for temporal-awareness training. In the final stage, to further enhance multi-modality comprehension and integration with the MMF mechanism, we construct an instruction-tuning dataset using samples from Activity Net Captions [13] and Fine Action [21].
Dataset Splits No The paper states 'we divide the Fine Action dataset into training and testing sets based on videos, allocating 75% to the training set and 25% to the test set', but does not explicitly provide details for a validation split.
Hardware Specification Yes All experiments are conducted on 8 V100 GPUs.
Software Dependencies Yes In our experiments, we implement LLa MA-VID [17] as baseline and utilize Vicuna-7B v1.5 [44] as our foundational LLM.
Experiment Setup Yes We adjust the resolution of input videos to 224 × 224, and each frame is condensed into 64 tokens. The low-frequency sampling interval ML is set to match the original video fps, ensuring one frame is sampled every second. We define the dense sampling number NH as 20 and the size of temporal token space N as 1000. The Adam W [22] optimizer is applied with a cosine learning rate and decay and a warm-up period. We train our Vid-LLM for three stages. During the initial pre-training stage, the learning rate is set to 1 × 10−3. For the subsequent fine-tuning stages, the learning rate is adjusted to 2 × 10−4. Additionally, the Lo RA parameters are configured with r = 64 and alpha = 128.