Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Each Complexity Deserves a Pruning Policy

Authors: Hanshi Wang, Yuhao Xu, Zekun Xu, Jin Gao, Yufan Liu, Weiming Hu, Ke Wang, Zhipeng Zhang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate our framework across a diverse suite of vision language benchmarks, comparing it against state-of-the-art token pruning methods on a single NVIDIA Tesla A100 GPU. Tab. 1 reports results on multi-modal tasks commonly employed in previous token pruning studies. Tab. 2 assesses the generalizability of our work by applying our Auto Prune to other VLMs. Furthermore, we demonstrate the generalizability of Auto Prune to embodied robots for autonomous driving (Tab. 3).
Researcher Affiliation	Collaboration	1State Key Laboratory of Multimodal Artificial Intelligence Systems (MAIS), CASIA 2School of Artificial Intelligence, University of Chinese Academy of Sciences 3Auto Lab, School of Artificial Intelligence, Shanghai Jiao Tong University 4Anyverse Intelligence 5Beijing Key Laboratory of Super Intelligent Security of Multi-Modal Information 6School of Information Science and Technology, Shanghai Tech University 7Kargo Bot EMAIL; EMAIL
Pseudocode	No	The paper describes the method using mathematical formulas and prose, but there is no explicit pseudocode or algorithm block.
Open Source Code	Yes	Code is available at https://github.com/Auto Lab-SAI-SJTU/Auto Prune.
Open Datasets	Yes	We evaluate our Auto Prune integrated into LLa VA-1.5-7B [27] on five standard vision-language benchmarks, including MME [46], MMB [47], Science QA (SQA) [48], GQA [49], and Text VQA [50]. For equitable comparison, all methods are benchmarked on datasets employed in prior work [19], encompassing VQAV2 [52], GQA [49], Text VQA [50], POPE [53], and MME [46]. This evaluation utilized the Senna model [10] and its associated customized nu Scenes dataset.
Dataset Splits	No	The paper uses standard vision-language benchmarks like MME, MMB, SQA, GQA, Text VQA, VQAV2, POPE, and nu Scenes, which typically have predefined splits. However, the paper does not explicitly detail the training/test/validation splits used for these datasets within its main text, nor does it refer to specific citations for these splits.
Hardware Specification	Yes	We evaluate our framework across a diverse suite of vision language benchmarks, comparing it against state-of-the-art token pruning methods on a single NVIDIA Tesla A100 GPU.
Software Dependencies	No	The paper does not provide specific version numbers for any software dependencies or libraries used in the implementation of their method or experiments.
Experiment Setup	No	The paper discusses the methodology, pruning ratios, and token budgets (e.g., 89% visual tokens pruned, retaining 64 tokens), and refers to integration with pre-trained models like LLa VA-1.5-7B. However, it does not explicitly provide concrete hyperparameter values such as learning rates, batch sizes, optimizers, or number of training epochs for the experimental setup. It mentions parameters for the logistic function (k0, γ, β, x0) but does not provide their values or how they were tuned.