Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

ASDSV: Multimodal Generation Made Efficient with Approximate Speculative Diffusion and Speculative Verification

Authors: Kaijun Zhou, Xingyu Yan, Xingda Wei, Xijun Li, Jinyu Gu

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments demonstrate that ASDSV achieves up to 1.77 -3.01 speedup in model inference with a minimal 0.3%-0.4% drop in VBench score, showcasing its effectiveness in accelerating multimodal diffusion models without significant quality degradation.
Researcher Affiliation	Academia	Kaijun Zhou, Xingyu Yan, Xingda Wei, Xijun Li, Jinyu Gu School of Computer Science, Shanghai Jiao Tong University
Pseudocode	No	Workflow. Figure 3 illustrates the workflow of a denoising process with ASDSV. The initial N steps are generated by the target model without any speculative steps, where the input for each step consists of the timestep embedding at the step and the output image of the previous step. After that, the repeated Speculative Diffusion and Speculative Verification process starts.
Open Source Code	No	The code will be publicly available once the acceptance of the paper.
Open Datasets	Yes	Each comparison method generates 10k images using the COCO Captions 2014 dataset [5]. For video generation with Wan2.1, like prior studies [24], we evaluate the generation quality using VBench Score [16] to evaluate the visual quality of the generated videos, LPIPS, PSNR and Structural Similarity Index Measure (SSIM) [36] to evaluate the similarity between the generated videos and the ground truth. Each comparison method generates 2k videos using the VBench dataset [16].
Dataset Splits	No	Each comparison method generates 10k images using the COCO Captions 2014 dataset [5]. Each comparison method generates 2k videos using the VBench dataset [16].
Hardware Specification	Yes	We measure the latency per sample on a single NVIDIA A800 GPU using Pytorch 2.6.0 and CUDA 12.4.
Software Dependencies	Yes	We measure the latency per sample on a single NVIDIA A800 GPU using Pytorch 2.6.0 and CUDA 12.4.
Experiment Setup	Yes	For both image and video generation, we set the total number of denoising steps to 50. For textto-image generation, we use Flux-SVD[21] as the draft model. For ASDSV-slow, we set γ1 = 3, γ2 = 9, and warmup ratio to 15%. For ASDSV-fast, we use γ1 = γ2 = 9. We set initial steps N to 8% of total steps for both variants and a verification threshold (δ) of 0.02. For text-to-video generation, we use Wan2.1-1.3B as the draft model. For ASDSV-slow, we set γ1 = 2, γ2 = 9, and warmup ratio to 25%. For ASDSV-fast, we use γ1 = γ2 = 9. We set initial steps N to 10% of total steps for the fast variant and 15% for the slow variant with (δ) 0.2.