Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
FPSAttention: Training-Aware FP8 and Sparsity Co-Design for Fast Video Diffusion
Authors: Akide Liu, Zeyu Zhang, Zhexin Li, Xuehai Bai, Yuanjie Xing, Yizeng Han, Jiasheng Tang, Jichao Wu, Mingyang Yang, Weihua Chen, Jiahao He, Yuanyu He, Fan Wang, Reza Haffari, Bohan Zhuang
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Trained on Wan2.1 s 1.3B and 14B models and evaluated on the VBench benchmark, FPSAttention achieves a 7.09 kernel speedup for attention operations and a 4.96 end-to-end speedup for video generation compared to the BF16 baseline at 720p resolution without sacrificing generation quality. Project page: https://fps.ziplab.co. Table 1: Efficiency comparison of the BF16 baseline, FP8 quantization, STA sparse attention, and our FPSAttention method on Wan2.1-14B at 720p resolution on an NVIDIA H20 GPU. We report both kernel-level and end-to-end speedups relative to the BF16 baseline. In this section, we conduct a comprehensive ablation study to analyze the effects of key components in FPSAttention. |
| Researcher Affiliation | Collaboration | 1Monash University 2DAMO Academy, Alibaba Group 3ZIP Lab, Zhejiang University 4Hupan Lab |
| Pseudocode | Yes | A FPSAttention Algorithm Algorithm 1 FPSAttention: Joint Tile-wise FP8 Quantization and Sparse Attention |
| Open Source Code | No | Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: We provide code after acceptance. |
| Open Datasets | Yes | We evaluate the our method on the public video dataset, VBench [17]. [17] Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive benchmark suite for video generative models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21807 21818, 2024. |
| Dataset Splits | No | The paper mentions using the VBench dataset for evaluation, and also discusses training on a curated high-quality video dataset standardized to 480p resolution, 16fps frame rate, and 5-second duration. However, it does not explicitly provide details about training/validation/test splits for this training data (e.g., exact percentages or sample counts), nor does it specify how the VBench dataset is split for experimental purposes beyond sampling 5 videos per prompt. |
| Hardware Specification | Yes | Trained on Wan2.1 s 1.3B and 14B models and evaluated on the VBench benchmark, FPSAttention achieves a 7.09 kernel speedup for attention operations and a 4.96 end-to-end speedup for video generation compared to the BF16 baseline at 720p resolution on an NVIDIA H20 GPU. Appendix B: Additional Implementation Details Hardware. Experiments utilize a distributed computing cluster with high-performance GPU nodes, each containing 192 CPU cores, 960GB system memory, and 8 NVIDIA H20 GPUs (96GB each). Infini Band interconnects ensure high-bandwidth inter-node communication for distributed training. |
| Software Dependencies | No | Fused kernels were compiled using Triton to accelerate inference on Hopper GPUs. The quantization schemes and sparsity patterns were applied across attention mechanisms using score mod and mask mod functions via Flex Attention [8]. |
| Experiment Setup | Yes | Table 8: Comprehensive hyperparameter configuration for Wan 1.3B and 13B model training and evaluation. The table covers model architecture specifications, training parameters, diffusion scheduler settings, data configuration, and system-level precision settings used in our experiments. Category Parameter Wan 1.3B Wan 13B Training Learning Rate 5e-6 5e-6 Weight Decay 1e-4 1e-4 Gradient Clipping 1.0 1.0 Warmup Steps 200 200 EMA Decay 0.99 0.99 Adam Epsilon 1e-15 1e-15 |