Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
VideoLLM-MoD: Efficient Video-Language Streaming with Mixture-of-Depths Vision Computation
Authors: Shiwei Wu, Joya Chen, Kevin Qinghong Lin, Qimeng Wang, Yan Gao, Qianli Xu, Tong Xu, Yao Hu, Enhong Chen, Mike Zheng Shou
NeurIPS 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct extensive experiments to demonstrate the effectiveness of VIDEOLLM-MOD, showing its state-of-the-art results on multiple benchmarks, including narration, forecasting, and summarization tasks in COIN, Ego4D, and Ego-Exo4D datasets. |
| Researcher Affiliation | Collaboration | 1University of Science and Technology of China & State Key Laboratory of Cognitive Intelligence 2Show Lab, National University of Singapore 3Xiaohongshu Inc. 4Institute for Infocomm Research, A*STAR |
| Pseudocode | No | The paper includes a model architecture diagram (Figure 6) and mathematical formulas for FLOPs, but no structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | The code and checkpoints will be made available at github.com/showlab/Video LLM-online. |
| Open Datasets | Yes | We validate the effectiveness of our proposed VIDEOLLM-MOD on both online and offline settings, including egocentric video dataset Ego4D [27] and Ego Exo4D [28], as well as instructional video dataset COIN [71]. |
| Dataset Splits | Yes | Ego4D Narration Stream Benchmark: Following Video LLM-online [9], we utilize the dense Ego4D timestamp-narrations to create a streaming set, aiming to generate timely narrations similar to those produced by Ego4D human annotators [27]. ... We use the standard Ego4D v2 splits... |
| Hardware Specification | Yes | We trained all models on 8 NVIDIA A100 GPUs. ... Calculated via Deep Speed FLOPs Profiler, processing 600 frames with (1+3 3) patches on a single NVIDIA A100 GPU. |
| Software Dependencies | No | The paper mentions specific models like 'Sig LIP-L/16', 'Meta-Llama-3-8B-Instruct', and 'Lo RA', but does not provide specific version numbers for underlying software dependencies such as Python, PyTorch, or CUDA. |
| Experiment Setup | Yes | For a fair comparison, we trained the models on the Ego4D narration benchmark for 2 epochs with a learning rate of 2 10 4. For the Ego4D LTA benchmark, Ego Exo4D fine-grained keystep recognition benchmark, and Coin benchmark, we trained the models for 6, 10, and 5 epochs with learning rates of 3 10 4, 2 10 4, and 1 10 4, respectively. During training, we set the batch size to 64 and streaming loss weight σ to 1.0 by default. For the trade-off between computation cost and performance, we insert Layer Expert every other layer and set the keep ratio r to 0.2 as the default setting. |