Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Faster Video Diffusion with Trainable Sparse Attention
Authors: Peiyuan Zhang, Yongqi Chen, Haofeng Huang, Will Lin, Zhengzhong Liu, Ion Stoica, Eric P Xing, Hao Helen Zhang
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We perform a large sweep of ablation studies and scaling-law experiments by pretraining Di Ts from 60M to 1.4B parameters. VSA reaches a Pareto point that cuts training FLOPS by 2.53 with no drop in diffusion loss. Retrofitting the open-source Wan2.1-1.3B model speeds up attention time by 6 and lowers end-to-end generation time from 31s to 18s with comparable quality, while for the 14B model, end-to-end generation time is reduced from 1274s to 576s. |
| Researcher Affiliation | Academia | Peiyuan Zhang1 Yongqi Chen1 Haofeng Huang1 Will Lin1 Zhengzhong Liu2 Ion Stoica3 Eric P. Xing2 Hao Zhang1 1UC San Diego 2MBZUAI 3UC Berkeley |
| Pseudocode | Yes | Appendix B Pseudocode of VSA We provide a pseudocode in a pytorch-like API for easier understanding of VSA. |
| Open Source Code | Yes | Code is available at https://github.com/hao-ai-lab/Fast Video. |
| Open Datasets | Yes | Our experiments are based on the Wan2.1 model architecture, a state-of-the-art open-source video Di T. Unless otherwise specified, we train models with 120M parameters from scratch for 4.5 1020 FLOPS using video latents of shape (16, 32, 32) from the Vchitect-T2V-Dataverse dataset [10]. |
| Dataset Splits | No | In the finetuning experiments for Wan-1.3B, we trained on 80,000 synthetically generated videos from Wan-14B, each with a resolution of 448 832 and 61 frames. In the finetuning experiments for Wan-14B, we set the final sparsity to 0.9 and trained on 200,000 synthetic videos from Wan-14B, each with a resolution of 768 1280 and 77 frames. |
| Hardware Specification | Yes | Each ablation job takes around 10 hours on 64 Nvidia H200 GPU. Each model was trained with compute budgets up to 4 1021 FLOPS on 128 H200 GPUs with sequence length of 16K. With VSA, the Di T inference time of Wan-1.3B drops from 31s (full attention with torch compile) to 18s. This integration speeds up attention time by 6x and reduces end-to-end inference latency from 31s to 18s (1.7x) on H100. |
| Software Dependencies | No | The paper mentions several tools and frameworks used, such as Flash Attention [5], Flashattention-3 [31], Thunder Kittens [32], torch compile, and that the pseudocode is in a pytorch-like API. It also uses UMT5-XXL [4] as the text encoder. However, it does not provide specific version numbers for these key software components (e.g., Python, PyTorch, CUDA versions). |
| Experiment Setup | Yes | To establish a strong baseline, we perform a grid search over batch sizes {512, 1024, 2048} and learning rates {5 10 5, 1 10 4, 2 10 4, 6 10 4}. The best hyperparameters is used for all ablation variants. Full training hyperparameters are provided in Table 2b. Table 2b: Learning Rate 6e-4, LR Scheduler Constant, Warmup Steps 100, Batch Size 1024, Weight Decay 1e-2, Adam W Betas (0.9, 0.95), Objective Flow Matching [24, 22], Timestep Sampler Logit Normal(0.0, 1.0) [9]. In the finetuning experiments for Wan-1.3B, we set the per-GPU batch size to 1, applied a gradient accumulation of 2, and used a learning rate of 1e 5. |