Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
VCM: Vision Concept Modeling with Adaptive Vision Token Compression via Instruction Fine-Tuning
Authors: Run Luo, Renke Shan, Longze Chen, Ziqiang Liu, Lu Wang, Min Yang, Xiaobo Xia
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments demonstrate that VCM remarkably reduces computational costs (e.g., achieving up to 85% fewer FLOPs for LLa VA-1.5-7B), while maintaining strong performance across a series of vision-language tasks. The codebase is available at https://github.com/RainBowLuoCS/VCM. |
| Researcher Affiliation | Academia | 1 Shenzhen Key Laboratory for High Performance Data Mining, Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences 2University of Chinese Academy of Sciences 3Shenzhen University of Advanced Technology 4National University of Singapore 5Mo E Key Laboratory of Brain-inspired Intelligent Perception and Cognition, University of Science and Technology of China |
| Pseudocode | Yes | For better understanding, we provide detailed pseudocodes for the training pipeline of LVCM in Algorithm 1 and for the SM operation in Algorithm 2 respectively (check Appendix E for more details). |
| Open Source Code | Yes | The codebase is available at https://github.com/RainBowLuoCS/VCM. |
| Open Datasets | Yes | Specifically, for visual question answering (VQA) evaluation, we conduct experiments on 11 widely adopted image-based benchmarks, including VQAV2 [21], GQA [22], Vis Wiz [23], Sci QA [24], POPE [25], MME [26], MMBench (MMB) [27], SEED-Image (SEED) [28], MMvet [29], Text VQA (VQAT) [30], and MMStar [31]. Also, Ref COCO [32] is used for the region-level VQA task. Besides, we employ COCO [33] and OV-COCO [34] for zero-shot image classification and open-vocabulary object detection, and use ADE-150 [35] and ADE-847 [35] for open-vocabulary semantic segmentation. To test the video understanding capability of VCM, 4 common video question answering benchmarks, TGIF-QA [36], MSVD-QA [37], MSRVTT-QA [37], and Activity Net-QA [38], are included. |
| Dataset Splits | Yes | All these evaluation tasks and metrics are listed in Table 5. Table 5: Overall descriptions of the evaluation benchmarks for visual question answering (VQA), zero-shot image classification (ZERO.), open-vocabulary object detection (OV-OD), open-vocabulary semantic segmentation (OV-SS), and video understanding. Tasks Datasets Descriptions Eval Splits Metrics VQAv2 [21] Scene understanding QA test-dev VQA Acc ( ) [79] GQA [22] Scene understanding QA test-dev VQA Acc ( ) [79] Viz Wiz [80] Scene understanding QA test-dev VQA Acc ( ) [79] Sci QA [24] External knowledge QA val VQA Acc ( ) [79] POPE [25] Visual hallucination val Acc ( ) MMB [27] Visual comprehension dev-en Acc ( ) Ref COCO [32] Region-level VQA val, test A, test B CIDEr ( ) [81] |
| Hardware Specification | Yes | The experiments are conducted with 8 NVIDIA A100-80G GPUs. |
| Software Dependencies | No | The paper mentions "LLa VA-1.5" and "LLa VA-Ne XT [1]" models, "Adam W optimizer [41]", "Py Torch implementation" (Algorithm 2). However, it does not provide specific version numbers for Python, PyTorch, CUDA, or other libraries. |
| Experiment Setup | Yes | The Adam W optimizer [41] is exploited, with learning rates 5 10-5 and 2 10-5 for aforementioned two stages respectively. The instruction fine-tuning stage is trained with two epochs with a 3% warmup strategy. The experiments are conducted with 8 NVIDIA A100-80G GPUs. As shown in Table 4, all experiments are performed with a batch size of 128. The number of training steps is set to 500 for fast evaluation. |