Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

ViSpec: Accelerating Vision-Language Models with Vision-Aware Speculative Decoding

Authors: Jialiang Kang, Han Shu, Wenshuo Li, Yingjie Zhai, Xinghao Chen

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments validate Vi Spec, achieving, to our knowledge, the first substantial speedup in VLM speculative decoding. Code is available at https://github.com/Kang Jialiang/Vi Spec.
Researcher Affiliation Collaboration Jialiang Kang1,2 Han Shu2 Wenshuo Li2 Yingjie Zhai2 Xinghao Chen2 1Peking University 2Huawei Noah s Ark Lab EMAIL EMAIL
Pseudocode No The paper describes algorithms and methods but does not present them in a structured pseudocode block or clearly labeled algorithm.
Open Source Code Yes Code is available at https://github.com/Kang Jialiang/Vi Spec.
Open Datasets Yes We train the draft models for both baselines and Vi Spec using a two-stage process. Initially, all draft models are trained on the Share GPT dataset, comprising 68,000 dialogue iterations, to establish a robust text-based foundation. For multimodal training, we fine-tune the baseline draft models (Medusa and EAGLE-2) on 68,000 samples randomly selected from the LLa VA Visual Instruct Pretrain LCS dataset [15], enabling them to process visual inputs. For Vi Spec, we augment this dataset with synthetic long assistant responses generated using the target VLM, as described in Sec. 4.3. Tasks. We evaluate performance on eight diverse multimodal benchmarks: Science QA (SQA) [24], MM-Vet [35], MME [7], Text VQA [27], COCO Captions (COCO Caps) [4], Viz Wiz [9], GQA [11], and SEED-Bench [14].
Dataset Splits No The paper mentions using
Hardware Specification No Hardware. All experiments are conducted on a single GPU. Draft models are trained using 8x GPUs.
Software Dependencies No We utilize models from the Hugging Face Transformers library with the Py Torch backend and pre-allocated KV cache. All other methods build upon these models.
Experiment Setup Yes Medusa. We implement a 1-layer, 3-head Medusa model, adhering to its default configuration. For training on both text-only and vision-language datasets, we use a learning rate of 3e-5, a batch size of 8, and the Adam W optimizer. The model is trained for 20 epochs with a 1-epoch warmup and linear learning rate decay. We set a maximum sequence length of 2048 for both dataset types. For inference, we adopt EAGLE-2 s draft tree structure, configuring a total of 30 draft tokens, a tree depth of 3, and selecting 8 nodes during the expansion phase across all models and tasks. EAGLE-2. We employ a 1-layer EAGLE-2 model, following its default settings. Training on text-only and vision-language datasets uses a learning rate of 3e-5, a batch size of 8, and the Adam W optimizer. The model is trained for 20 epochs with a 1-epoch warmup and linear learning rate decay, with a maximum sequence length of 2048 for both dataset types. For inference, we use EAGLE-2 s draft tree with 30 draft tokens, a tree depth of 3, and 8 nodes selected during expansion, applied uniformly across all models and tasks. Vi Spec. We implement a single-layer draft model that mirrors a decoder layer of the target model. For training on text-only and vision-language datasets, we use a learning rate of 3e-6, a batch size of 8, and the Adam W optimizer. The model is trained for 20 epochs with a 1-epoch warmup and linear learning rate decay, supporting a maximum sequence length of 2048 for both dataset types. During inference, we adopt EAGLE-2 s draft tree structure, configuring 30 draft tokens, a tree depth of 3, and selecting 8 nodes during expansion, applied consistently across all models and tasks.