Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
SCOPE: Saliency-Coverage Oriented Token Pruning for Efficient Multimodel LLMs
Authors: Jinhong Deng, Wen Li, Joey Tianyi Zhou, Yang He
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct extensive experiments on multiple vision-language understanding benchmarks using the LLa VA-1.5 and LLa VA-Next models. Experimental results demonstrate that our method consistently outperforms prior approaches. To evaluate the effectiveness of our SCOPE, we conduct extensive experiments on a variety of vision-language understanding benchmarks using popular MLLMs, including LLa VA-1.5 [24] and LLa VA-Next [25]. The results demonstrate that our method consistently outperforms prior approaches by a significant margin (see Fig. 1(c)). |
| Researcher Affiliation | Academia | 1University of Electronic Science and Technology of China (UESTC) 2Shenzhen Institute for Advanced Study, UESTC 3CFAR, Agency for Science, Technology and Research (A*STAR), Singapore 4IHPC, Agency for Science, Technology and Research (A*STAR), Singapore EMAIL, EMAIL |
| Pseudocode | Yes | The pseudocode of the proposed pruning method is presented in Algorithm 1. |
| Open Source Code | Yes | Our code is available at https://github.com/kinredon/SCOPE. |
| Open Datasets | Yes | Following prior work[49], we evaluate the effectiveness of the proposed method using a set of widely adopted multimodal benchmarks. Specifically, these include GQA [13], MMBench[27], POPE [22], Science QA[29], Text VQA [36], SEEDBench[18], and MMVet [45]. For the video benchmarks, we evaluate the MLLMs on the benchmarks TGIF [15], MSVD [5], MSRVTT [42], and Activity Net [46]. |
| Dataset Splits | Yes | Our method is evaluated on the subset of testdev_balanced_instructions , which includes 12,578 samples. (GQA) We evaluate the performance on the dev split including 4,377 samples. (MME) We evaluate the model s performance on the test split, including 9,000 samples. (POPE) We evaluate the model s performance on the test split, including 5,000 samples. (Text VQA) We evaluate the model s performance on the test split, including 218 samples. (MMVet) |
| Hardware Specification | Yes | We conduct the experiments on 4 A100 GPUs. |
| Software Dependencies | No | Our implementation is based on the lmms-evals [48] package. |
| Experiment Setup | Yes | The scaling factor α is set to 1.0 by default. Our implementation is based on the lmms-evals [48] package. We conduct the experiments on 4 A100 GPUs. The inference batch size is set to 1 for all the evaluation results. |