Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

DyMU: Dynamic Merging and Virtual Unmerging for Efficient Variable-Length VLMs

Authors: Zhenhailong Wang, Senthil Purushwalkam, Caiming Xiong, Silvio Savarese, Heng Ji, Ran Xu

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on image and video understanding tasks, demonstrate that DYMU can reduce the average visual token count by 32%-85% while achieving comparable performance to full-length models, across diverse VLM architectures.
Researcher Affiliation Collaboration Zhenhailong Wang1*, Senthil Purushwalkam2*, Caiming Xiong2, Silvio Savarese2, Heng Ji1, Ran Xu2 1University of Illinois Urbana-Champaign 2Salesforce Research
Pseudocode No The paper describes methods using structured text and mathematical formulations (e.g., Section 3.1 Dynamic Token Merging (DTo Me) and Section 3.2 Virtual Token Unmerging (VTU)), but does not include explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes Answer: [Yes] Justification: All datasets used in this work are publicly available. We also provide code as part of supplemental material.
Open Datasets Yes Answer: [Yes] Justification: All datasets used in this work are publicly available. We also provide code as part of supplemental material. ... sampled from the SFT instruction tuning data of LLa VA 1.5 [25] comprising of images from MS-COCO [24], Visual Genome [17], OCR-VQA [32], Text VQA [36] and GQA [16].
Dataset Splits Yes For results on LLa VA-1.5 (as in Tables 2 and 3) we leverage the official evaluation code from LLa VA-1.5. The results on MME and LLa VA-Bench for DYMU are averaged across three runs, as we observe a higher variance on these two tasks. ... For results on LLa VA-One Vision (as in Table 4), we leverage VLMEval Kit for getting the evaluation results.
Hardware Specification No The paper mentions 'optimized GPU kernels' in Section 5, but does not provide specific hardware details such as GPU models, CPU types, or memory specifications used for the experiments.
Software Dependencies No The paper mentions 'PyTorch' (Section 5) and LLMs like 'Vicuna-7B [7]' and 'Qwen2 [41]', but does not provide specific version numbers for these or other software libraries and dependencies.
Experiment Setup Yes For DTo Me, we find layer-wise thresholds using a diverse dataset of 250k images sampled from the SFT instruction tuning data of LLa VA 1.5 [25] comprising of images from MS-COCO [24], Visual Genome [17], OCR-VQA [32], Text VQA [36] and GQA [16]. ... For each visual encoder in the experiments, including CLIP [34] and Sig LIP [47, 1]||**, we find thresholds for three variants of the encoder by choosing different average number of tokens to drop (ri) in each layer. We represent these variants by -low, -mid, -high corresponding to the expected average number of tokens. ... The results on MME and LLa VA-Bench for DYMU are averaged across three runs, as we observe a higher variance on these two tasks.