Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Balanced Token Pruning: Accelerating Vision Language Models Beyond Local Optimization
Authors: kaiyuan Li, Xiaoyue Chen, Chen Gao, Yong Li, Xinlei Chen
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments across various LVLMs demonstrate the broad effectiveness of our approach on multiple benchmarks. Our method achieves a 78% compression rate while preserving 96.7% of the original models performance on average. |
| Researcher Affiliation | Academia | 1Tsinghua Shenzhen International Graduate School 2BNRist, Tsinghua University EMAIL EMAIL,EMAIL |
| Pseudocode | No | The paper describes methods through textual explanations and mathematical equations (e.g., Equation 1, 2, 3, 4, 5, 6, 7) but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks or figures. |
| Open Source Code | Yes | Our code is available at https://github.com/EmbodiedCity/NeurIPS2025Balanced-Token-Pruning. |
| Open Datasets | Yes | We conduct comprehensive experiments on standard visual understanding tasks using models of different sizes, model families, and compression ratios. We report the results on GQA, MMB, MME, POPE, SQA and MM-Ve T [13, 21, 22, 30, 49, 50]. All experiments are carried out using the LMMs-Eval [3, 24] framework. ... To determine pruning stages, we randomly sample 64 instances from the LLa VA-655k [27, 28, 29] dataset and use the same set across all models and benchmarks, thus avoiding separate calibration for each benchmark. |
| Dataset Splits | No | The paper mentions sampling 64 instances from the LLa VA-655k dataset for calibration and details pruning stages based on token retention percentages, but it does not specify explicit training, validation, or test splits for the main benchmarks (GQA, MMB, MME, POPE, SQA, MM-Ve T) used in the evaluation tables. |
| Hardware Specification | Yes | All pruning experiments are conducted on 8 NVIDIA A800 GPUs using the Hugging Face Transformers library. |
| Software Dependencies | No | The paper mentions using the 'Hugging Face Transformers library' and 'Flash Attn module' / 'Flash Attention acceleration' but does not specify version numbers for these software components. |
| Experiment Setup | Yes | In the early layers, we use a larger λ value to focus more on global information, while in the deeper layers, we use a smaller lambda to emphasize local details. More implementation details for different models are provided in the see Appendix 7.3. ... The λ used for different models are shown below: Table 7: λ settings in different models llava-v1.5-7b (0.6,0.8,1.0) llava-v1.5-13b (0.6,0.8,1.0) llava-v1.6-13b (0.4,0.7,1.0) qwen-2.5-vl-7b (0.2,0.5,0.8,1.0) |