VideoLLM-MoD: Efficient Video-Language Streaming with Mixture-of-Depths Vision Computation
Authors: Shiwei Wu, Joya Chen, Kevin Qinghong Lin, Qimeng Wang, Yan Gao, Qianli Xu, Tong Xu, Yao Hu, Enhong Chen, Mike Zheng Shou
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct extensive experiments to demonstrate the effectiveness of VIDEOLLM-MOD, showing its state-of-the-art results on multiple benchmarks, including narration, forecasting, and summarization tasks in COIN, Ego4D, and Ego-Exo4D datasets. |
| Researcher Affiliation | Collaboration | 1University of Science and Technology of China & State Key Laboratory of Cognitive Intelligence 2Show Lab, National University of Singapore 3Xiaohongshu Inc. 4Institute for Infocomm Research, A*STAR |
| Pseudocode | No | The paper includes a model architecture diagram (Figure 6) and mathematical formulas for FLOPs, but no structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | The code and checkpoints will be made available at github.com/showlab/Video LLM-online. |
| Open Datasets | Yes | We validate the effectiveness of our proposed VIDEOLLM-MOD on both online and offline settings, including egocentric video dataset Ego4D [27] and Ego Exo4D [28], as well as instructional video dataset COIN [71]. |
| Dataset Splits | Yes | Ego4D Narration Stream Benchmark: Following Video LLM-online [9], we utilize the dense Ego4D timestamp-narrations to create a streaming set, aiming to generate timely narrations similar to those produced by Ego4D human annotators [27]. ... We use the standard Ego4D v2 splits... |
| Hardware Specification | Yes | We trained all models on 8 NVIDIA A100 GPUs. ... Calculated via Deep Speed FLOPs Profiler, processing 600 frames with (1+3 3) patches on a single NVIDIA A100 GPU. |
| Software Dependencies | No | The paper mentions specific models like 'Sig LIP-L/16', 'Meta-Llama-3-8B-Instruct', and 'Lo RA', but does not provide specific version numbers for underlying software dependencies such as Python, PyTorch, or CUDA. |
| Experiment Setup | Yes | For a fair comparison, we trained the models on the Ego4D narration benchmark for 2 epochs with a learning rate of 2 10 4. For the Ego4D LTA benchmark, Ego Exo4D fine-grained keystep recognition benchmark, and Coin benchmark, we trained the models for 6, 10, and 5 epochs with learning rates of 3 10 4, 2 10 4, and 1 10 4, respectively. During training, we set the batch size to 64 and streaming loss weight σ to 1.0 by default. For the trade-off between computation cost and performance, we insert Layer Expert every other layer and set the keep ratio r to 0.2 as the default setting. |