Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Balanced Token Pruning: Accelerating Vision Language Models Beyond Local Optimization

Authors: kaiyuan Li, Xiaoyue Chen, Chen Gao, Yong Li, Xinlei Chen

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments across various LVLMs demonstrate the broad effectiveness of our approach on multiple benchmarks. Our method achieves a 78% compression rate while preserving 96.7% of the original models performance on average.
Researcher Affiliation	Academia	1Tsinghua Shenzhen International Graduate School 2BNRist, Tsinghua University EMAIL EMAIL,EMAIL
Pseudocode	No	The paper describes methods through textual explanations and mathematical equations (e.g., Equation 1, 2, 3, 4, 5, 6, 7) but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks or figures.
Open Source Code	Yes	Our code is available at https://github.com/EmbodiedCity/NeurIPS2025Balanced-Token-Pruning.
Open Datasets	Yes	We conduct comprehensive experiments on standard visual understanding tasks using models of different sizes, model families, and compression ratios. We report the results on GQA, MMB, MME, POPE, SQA and MM-Ve T [13, 21, 22, 30, 49, 50]. All experiments are carried out using the LMMs-Eval [3, 24] framework. ... To determine pruning stages, we randomly sample 64 instances from the LLa VA-655k [27, 28, 29] dataset and use the same set across all models and benchmarks, thus avoiding separate calibration for each benchmark.
Dataset Splits	No	The paper mentions sampling 64 instances from the LLa VA-655k dataset for calibration and details pruning stages based on token retention percentages, but it does not specify explicit training, validation, or test splits for the main benchmarks (GQA, MMB, MME, POPE, SQA, MM-Ve T) used in the evaluation tables.
Hardware Specification	Yes	All pruning experiments are conducted on 8 NVIDIA A800 GPUs using the Hugging Face Transformers library.
Software Dependencies	No	The paper mentions using the 'Hugging Face Transformers library' and 'Flash Attn module' / 'Flash Attention acceleration' but does not specify version numbers for these software components.
Experiment Setup	Yes	In the early layers, we use a larger λ value to focus more on global information, while in the deeper layers, we use a smaller lambda to emphasize local details. More implementation details for different models are provided in the see Appendix 7.3. ... The λ used for different models are shown below: Table 7: λ settings in different models llava-v1.5-7b (0.6,0.8,1.0) llava-v1.5-13b (0.6,0.8,1.0) llava-v1.6-13b (0.4,0.7,1.0) qwen-2.5-vl-7b (0.2,0.5,0.8,1.0)