Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

ASDSV: Multimodal Generation Made Efficient with Approximate Speculative Diffusion and Speculative Verification

Authors: Kaijun Zhou, Xingyu Yan, Xingda Wei, Xijun Li, Jinyu Gu

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments demonstrate that ASDSV achieves up to 1.77 -3.01 speedup in model inference with a minimal 0.3%-0.4% drop in VBench score, showcasing its effectiveness in accelerating multimodal diffusion models without significant quality degradation.
Researcher Affiliation Academia Kaijun Zhou, Xingyu Yan, Xingda Wei, Xijun Li, Jinyu Gu School of Computer Science, Shanghai Jiao Tong University
Pseudocode No Workflow. Figure 3 illustrates the workflow of a denoising process with ASDSV. The initial N steps are generated by the target model without any speculative steps, where the input for each step consists of the timestep embedding at the step and the output image of the previous step. After that, the repeated Speculative Diffusion and Speculative Verification process starts.
Open Source Code No The code will be publicly available once the acceptance of the paper.
Open Datasets Yes Each comparison method generates 10k images using the COCO Captions 2014 dataset [5]. For video generation with Wan2.1, like prior studies [24], we evaluate the generation quality using VBench Score [16] to evaluate the visual quality of the generated videos, LPIPS, PSNR and Structural Similarity Index Measure (SSIM) [36] to evaluate the similarity between the generated videos and the ground truth. Each comparison method generates 2k videos using the VBench dataset [16].
Dataset Splits No Each comparison method generates 10k images using the COCO Captions 2014 dataset [5]. Each comparison method generates 2k videos using the VBench dataset [16].
Hardware Specification Yes We measure the latency per sample on a single NVIDIA A800 GPU using Pytorch 2.6.0 and CUDA 12.4.
Software Dependencies Yes We measure the latency per sample on a single NVIDIA A800 GPU using Pytorch 2.6.0 and CUDA 12.4.
Experiment Setup Yes For both image and video generation, we set the total number of denoising steps to 50. For textto-image generation, we use Flux-SVD[21] as the draft model. For ASDSV-slow, we set γ1 = 3, γ2 = 9, and warmup ratio to 15%. For ASDSV-fast, we use γ1 = γ2 = 9. We set initial steps N to 8% of total steps for both variants and a verification threshold (δ) of 0.02. For text-to-video generation, we use Wan2.1-1.3B as the draft model. For ASDSV-slow, we set γ1 = 2, γ2 = 9, and warmup ratio to 25%. For ASDSV-fast, we use γ1 = γ2 = 9. We set initial steps N to 10% of total steps for the fast variant and 15% for the slow variant with (δ) 0.2.