Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference
Authors: Yuan Zhang, Chun-Kai Fan, Junpeng Ma, Wenzhao Zheng, Tao Huang, Kuan Cheng, Denis A Gudovskiy, Tomoyuki Okuno, Yohei Nakata, Kurt Keutzer, Shanghang Zhang
ICML 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results show that Sparse VLM increases the efficiency of various VLMs in a number of image and video understanding tasks. Our code is available at https: //github.com/Gumpest/Sparse VLMs. Extensive experiments demonstrate that our Sparse VLM effectively reduces computational overhead of various VLMs without sacrificing their performance in a wide range of image and video understanding tasks. |
| Researcher Affiliation | Collaboration | *Equal contribution 1State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University 2Fudan University 3EECS, UC Berkeley 4Shanghai Jiao Tong University 5Panasonic Holdings Corporation. Correspondence to: Wenzhao Zheng <EMAIL>, Shanghang Zhang <EMAIL>. |
| Pseudocode | No | The paper describes the methodology in Section 3 and Appendix B with equations and descriptive text, but it does not present any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our code is available at https: //github.com/Gumpest/Sparse VLMs. |
| Open Datasets | Yes | For image-based multimodal evaluation, we conduct experiments on eight widely adopted benchmarks, including GQA (Hudson & Manning, 2019), MMBench (MMB) (Liu et al., 2024c), MME (Fu et al., 2023), POPE (Li et al., 2023b), SQA (Lu et al., 2022), SEED-Bench (SEED) (Li et al., 2024a), VQAText (Text VQA) (Singh et al., 2019), and MMVet (Yu et al., 2024). We test on four common video question answering benchmarks, TGIF-QA (Jang et al., 2017), MSVD-QA (Xu et al., 2017), MSRVTT-QA (Xu et al., 2017), and Activity Net-QA (Yu et al., 2019). |
| Dataset Splits | Yes | We set 3 vision token count configurations (192, 128, and 64) to check the advantages of Sparse VLM comprehensively. When pruning from 576 to 192 tokens... To make a fair comparison, we both preserve 194 vision tokens (90.5% pruning ratio) for Fast V (Chen et al., 2024a) and Sparse VLM. Specifically, following Fast Vās (Chen et al., 2024a) setup, we use the first 1000 samples per benchmark and score them using the Video-Chat GPT (Maaz et al., 2024) evaluation tool, acknowledging the characteristic length imbalances in these datasets. |
| Hardware Specification | Yes | All of our experiments are conducted on a single NVIDIA A100-80G GPU. |
| Software Dependencies | Yes | The implementation is carried out in Python 3.10, utilizing Py Torch 2.1.2, CUDA 11.8, and transformers 4.31.0. |
| Experiment Setup | Yes | We set 3 vision token count configurations (192, 128, and 64) to check the advantages of Sparse VLM comprehensively. For LLa VA-1.5-7/13B, Mini-Gemini (MGM), and Qwen-VL, we follow the same inference setting as the original paper as it is publicly available. For video understanding tasks, we adopt the same inference setup as the original Video-LLa VA code base. |