Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

FastVID: Dynamic Density Pruning for Fast Video Large Language Models

Authors: Leqi Shen, Guoqiang Gong, Tao He, Yifeng Zhang, Pengzhang Liu, Sicheng Zhao, guiguang ding

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive evaluations show that Fast VID achieves state-of-the-art performance across various shortand longvideo benchmarks on leading Video LLMs, including LLa VA-One Vision, LLa VAVideo, Qwen2-VL, and Qwen2.5-VL. Notably, on LLa VA-One Vision-7B, Fast VID effectively prunes 90.3% of video tokens, reduces FLOPs to 8.3%, and accelerates the LLM prefill stage by 7.1 , while maintaining 98.0% of the original accuracy.
Researcher Affiliation Collaboration 1 School of Software, Tsinghua University 2 BNRist, Tsinghua University 3 JD.com 4 GRG Banking Equipment Co., Ltd.
Pseudocode No The paper describes the methodology, including Dynamic Temporal Segmentation and Density Spatiotemporal Pruning, with mathematical equations and descriptive text, but it does not include a clearly labeled pseudocode or algorithm block.
Open Source Code Yes The code is available at https://github.com/Lunar Shen/Fast VID.
Open Datasets Yes We evaluate our method on several widely used video understanding benchmarks: MVBench [21, 31], Long Video Bench [46], MLVU [58], and Video MME (wo sub.) [12].
Dataset Splits Yes Specifically, Video MME is officially divided into short, medium, and long subsets. These benchmarks contain videos of varying durations and complex scenarios, providing a comprehensive evaluation of our method s effectiveness and generalization.
Hardware Specification Yes We conduct all evaluations using LMMs-Eval [54] on A100 GPUs. ... The prefill time, defined as the latency to the first generated token, is measured on Video MME using an A100 GPU.
Software Dependencies No The paper mentions various Video LLMs and benchmarks, and states that "All experiments are conducted using LMMs-Eval2 [54] for consistency," with footnotes to GitHub repositories. However, it does not explicitly state specific version numbers for general software dependencies like Python, PyTorch, or CUDA.
Experiment Setup Yes Unless otherwise specified, we adopt the hyperparameter setting c = 8, τ = 0.9, d = 0.4, p = 4, β = 0.6 for all experiments. For LLa VA-One Vision, 32 sampled frames generate a 32 196 token input to the LLM. We experiment with r {25%, 20%, 15%, 10%}. For LLa VA-Video, 64 sampled frames generate a 64 169 token input. We experiment with r {25%, 10%}. For the Qwen-VL series, which samples up to 768 frames, we discard highly redundant frames and set τ to its optimal value.