Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
FlowCut: Rethinking Redundancy via Information Flow for Efficient Vision-Language Models
Authors: Jintao Tong, Wenwei Jin, Pengda Qin, Anqi Li, Yixiong Zou, Yuhong Li, Yuhua Li, Ruixuan Li
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments show that Flow Cut achieves superior results, outperforming So TA by 1.6% on LLa VA-1.5-7B with 88.9% token reduction, and by 4.3% on LLa VA-Ne XT-7B with 94.4% reduction, delivering 3.2 speed-up in the prefilling stage. |
| Researcher Affiliation | Collaboration | 1School of Computer Science and Technology, Huazhong University of Science and Technology 2Xiaohongshu Inc. 3Shanghai Jiao Tong University 1EMAIL 2EMAIL |
| Pseudocode | Yes | A.5 Pseudocode Here, we provide the pseudocode of our method. And note that method is indeed compatible with Flash Attention because it does not require the full N N attention map. Instead, it only needs the 1 N attention value of a single CLS token or global token. When using Flash Attention, we can obtain the necessary attention values with a small, separate computation: 1) For models with a CLS token, we separately compute its 1 N attention map with the patch tokens; 2) For models without a CLS token, we derive a global token (by averaging patch tokens) and then compute its 1 N attention map. This targeted computation is highly efficient, with a complexity of just O(n), adding virtually no overhead. This allows our method and Flash Attention to be used together effectively. Algorithm 1 Pseudocode for Flow Cut |
| Open Source Code | Yes | Our code is available at https://github.com/Tung Chintao/Flow Cut. |
| Open Datasets | Yes | We evaluated our method across fifteen benchmarks: twelve for image understanding tasks and three for video understanding tasks, each assessing different aspects of multimodal intelligence. GQA [20] The GQA benchmark is structured around three parts: scene graphs, questions, and images. It enriches visual content with comprehensive spatial information and object-level attributes. |
| Dataset Splits | Yes | We evaluated our method across fifteen benchmarks: twelve for image understanding tasks and three for video understanding tasks, each assessing different aspects of multimodal intelligence. GQA [20] The GQA benchmark is structured around three parts: scene graphs, questions, and images. [...] VQA-V2 [15] evaluates models visual perception capabilities through open-ended questions about 265,016 real-world scence images. Each question contains 10 human-annotated ground truth answers, enabling thorough assessment of a model s ability to interpret and respond to visual queries. |
| Hardware Specification | Yes | All of our experiments are conducted on Nvidia A800-80G GPU. |
| Software Dependencies | No | The paper does not provide specific software dependencies with version numbers, only mentioning the use of existing models and benchmarks. |
| Experiment Setup | Yes | Implement Details Our method can be seamlessly integrated into models in a training-free manner, or training from scratch, resulting in substantial improvements to inference speed and training efficiency. For LLa VA-1.5, we complete the pruning at the penultimate layer of the vision encoder. For LLa VANe XT and Qwen2-VL, which contain a larger number of tokens, our pruning process spans from the vision encoder to the second layer of the LLM, which we prune tokens to twice the target number in the vision encoder, then further reduce to the final target number at the second layer of the LLM. All of our experiments are conducted on Nvidia A800-80G GPU. [...] S(l) cum = 0.5 I(l) cur + 0.5 S(l-1) cum, where S(0) cum = I(0) cur if layer = 0 |