Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

SCOPE: Saliency-Coverage Oriented Token Pruning for Efficient Multimodel LLMs

Authors: Jinhong Deng, Wen Li, Joey Tianyi Zhou, Yang He

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct extensive experiments on multiple vision-language understanding benchmarks using the LLa VA-1.5 and LLa VA-Next models. Experimental results demonstrate that our method consistently outperforms prior approaches. To evaluate the effectiveness of our SCOPE, we conduct extensive experiments on a variety of vision-language understanding benchmarks using popular MLLMs, including LLa VA-1.5 [24] and LLa VA-Next [25]. The results demonstrate that our method consistently outperforms prior approaches by a significant margin (see Fig. 1(c)).
Researcher Affiliation	Academia	1University of Electronic Science and Technology of China (UESTC) 2Shenzhen Institute for Advanced Study, UESTC 3CFAR, Agency for Science, Technology and Research (ASTAR), Singapore 4IHPC, Agency for Science, Technology and Research (ASTAR), Singapore EMAIL, EMAIL
Pseudocode	Yes	The pseudocode of the proposed pruning method is presented in Algorithm 1.
Open Source Code	Yes	Our code is available at https://github.com/kinredon/SCOPE.
Open Datasets	Yes	Following prior work[49], we evaluate the effectiveness of the proposed method using a set of widely adopted multimodal benchmarks. Specifically, these include GQA [13], MMBench[27], POPE [22], Science QA[29], Text VQA [36], SEEDBench[18], and MMVet [45]. For the video benchmarks, we evaluate the MLLMs on the benchmarks TGIF [15], MSVD [5], MSRVTT [42], and Activity Net [46].
Dataset Splits	Yes	Our method is evaluated on the subset of testdev_balanced_instructions , which includes 12,578 samples. (GQA) We evaluate the performance on the dev split including 4,377 samples. (MME) We evaluate the model s performance on the test split, including 9,000 samples. (POPE) We evaluate the model s performance on the test split, including 5,000 samples. (Text VQA) We evaluate the model s performance on the test split, including 218 samples. (MMVet)
Hardware Specification	Yes	We conduct the experiments on 4 A100 GPUs.
Software Dependencies	No	Our implementation is based on the lmms-evals [48] package.
Experiment Setup	Yes	The scaling factor α is set to 1.0 by default. Our implementation is based on the lmms-evals [48] package. We conduct the experiments on 4 A100 GPUs. The inference batch size is set to 1 for all the evaluation results.