VideoLLM-MoD: Efficient Video-Language Streaming with Mixture-of-Depths Vision Computation

Authors: Shiwei Wu, Joya Chen, Kevin Qinghong Lin, Qimeng Wang, Yan Gao, Qianli Xu, Tong Xu, Yao Hu, Enhong Chen, Mike Zheng Shou

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct extensive experiments to demonstrate the effectiveness of VIDEOLLM-MOD, showing its state-of-the-art results on multiple benchmarks, including narration, forecasting, and summarization tasks in COIN, Ego4D, and Ego-Exo4D datasets.
Researcher Affiliation Collaboration 1University of Science and Technology of China & State Key Laboratory of Cognitive Intelligence 2Show Lab, National University of Singapore 3Xiaohongshu Inc. 4Institute for Infocomm Research, A*STAR
Pseudocode No The paper includes a model architecture diagram (Figure 6) and mathematical formulas for FLOPs, but no structured pseudocode or algorithm blocks.
Open Source Code Yes The code and checkpoints will be made available at github.com/showlab/Video LLM-online.
Open Datasets Yes We validate the effectiveness of our proposed VIDEOLLM-MOD on both online and offline settings, including egocentric video dataset Ego4D [27] and Ego Exo4D [28], as well as instructional video dataset COIN [71].
Dataset Splits Yes Ego4D Narration Stream Benchmark: Following Video LLM-online [9], we utilize the dense Ego4D timestamp-narrations to create a streaming set, aiming to generate timely narrations similar to those produced by Ego4D human annotators [27]. ... We use the standard Ego4D v2 splits...
Hardware Specification Yes We trained all models on 8 NVIDIA A100 GPUs. ... Calculated via Deep Speed FLOPs Profiler, processing 600 frames with (1+3 3) patches on a single NVIDIA A100 GPU.
Software Dependencies No The paper mentions specific models like 'Sig LIP-L/16', 'Meta-Llama-3-8B-Instruct', and 'Lo RA', but does not provide specific version numbers for underlying software dependencies such as Python, PyTorch, or CUDA.
Experiment Setup Yes For a fair comparison, we trained the models on the Ego4D narration benchmark for 2 epochs with a learning rate of 2 10 4. For the Ego4D LTA benchmark, Ego Exo4D fine-grained keystep recognition benchmark, and Coin benchmark, we trained the models for 6, 10, and 5 epochs with learning rates of 3 10 4, 2 10 4, and 1 10 4, respectively. During training, we set the batch size to 64 and streaming loss weight σ to 1.0 by default. For the trade-off between computation cost and performance, we insert Layer Expert every other layer and set the keep ratio r to 0.2 as the default setting.