Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
SpargeAttention: Accurate and Training-free Sparse Attention Accelerating Any Model Inference
Authors: Jintao Zhang, Chendong Xiang, Haofeng Huang, Jia Wei, Haocheng Xi, Jun Zhu, Jianfei Chen
ICML 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments show that our method significantly accelerates diverse models, including language, image, and video generation, without sacrificing end-to-end metrics. (Abstract) and 4. Experiment (Section 4 title). |
| Researcher Affiliation | Academia | 1Dept. of Comp. Sci. and Tech., Institute for AI, BNRist Center, THBI Lab, Tsinghua-Bosch Joint ML Center, Tsinghua University 2Institute for Interdisciplinary Information Sciences, Tsinghua University 3EECS, University of California, Berkeley. Correspondence to: Jun Zhu <EMAIL>. |
| Pseudocode | Yes | Algorithm 1 Implementation of Sparge Attn. |
| Open Source Code | Yes | The code is available at https://github.com/thu-ml/Sparge Attn. |
| Open Datasets | Yes | The Text-to-text model is evaluated on four zero-shot tasks: Wiki Text (Merity et al., 2017)... Longbench (Bai et al., 2024)... Infinite Bench (Zhang et al., 2024)... Needle-in-a-Haystack task (Kamradt, 2023)... open-sora (Zheng et al., 2024c) prompt sets. Text-to-image models are assessed on COCO annotations (Lin et al., 2014). |
| Dataset Splits | No | The paper mentions several datasets like Wiki Text, Longbench, Infinite Bench, Needle-in-a-Haystack, open-sora prompt sets, and COCO annotations, but does not provide specific training/test/validation splits (percentages, sample counts, or explicit references to standard splits for these datasets). |
| Hardware Specification | Yes | Full Attention End-to-End Time: 1897s on L40 Sparge Attn End-to-End Time: 1037s on L40 1.83x Speedup Figure 1. Sparge Attn can achieve 1.83x speedup on Mochi on L40 GPU, with no video quality loss. and Table 2. End-to-end generation latency using Sparge Attn. Model GPU Original Sage Attn Sparge Attn Cogvideo X RTX4090 87 s 68 s 53 s Mochi L40 1897 s 1544 s 1037 s Llama3.1 (24K) RTX4090 4.01 s 3.53 s 2.6 s Llama3.1 (128K) L40 52 s 42s 29.98 s |
| Software Dependencies | No | We implement our method using CUDA. (Section 4.1) and Sparge Attn+FA2 means deploying our method on Flash Attention2. (Figure 10 caption). The paper mentions software like CUDA, Flash Attention2, and Sage Attention, but does not specify their version numbers. |
| Experiment Setup | Yes | As discussed in Sec. 3.6, we need to determine l1, l2 for models. We use (l1 = 0.08, l2 = 0.09) for Llama3.1, (l1 = 0.05, l2 = 0.06) for Cogvideo X and Mochi, and (l1 = 0.07, l2 = 0.08) for Stable-Diffusion3.5 and Flux, (l1 = 0.03, l2 = 0.035) for Open-Sora-Plan. (Section 4.1) |