Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
SageAttention2: Efficient Attention with Thorough Outlier Smoothing and Per-thread INT4 Quantization
Authors: Jintao Zhang, Haofeng Huang, Pengle Zhang, Jia Wei, Jun Zhu, Jianfei Chen
ICML 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Comprehensive experiments confirm that our approach incurs negligible end-to-end metrics loss across diverse models, including those for language, image, and video generation. The code is available at https://github.com/ thu-ml/Sage Attention. ... We validate the effectiveness of Sage Attention2 across a diverse set of representative models from language, image, and video generation. Specifically, we conduct experiments on ten models... We compare the speed of Sage Attention2 against baselines using headdim=64 and headdim=128, both with and without Causal Mask (Vaswani, 2017). ... Table 4 and 17 show the average accuracy of different methods with INT4 Q, K and FP8 P, V across all layers of Cogvideo X. ... We assessed the end-to-end metrics of various models using Sage Attention2 compared to baselines. ... Ablation Study. Overhead of techniques we proposed. As shown in Table 18, the overhead on kernel speed of per-thread quantization, smoothing Q, and two-level accumulation are 0.35%, 3.7%, and 0% compared to the attention kernel. |
| Researcher Affiliation | Academia | 1Dept. of Comp. Sci. and Tech., Institute for AI, BNRist Center, THBI Lab, Tsinghua-Bosch Joint ML Center, Tsinghua University 2Institute for Interdisciplinary Information Sciences, Tsinghua University. Correspondence to: Jianfei Chen <EMAIL>. |
| Pseudocode | Yes | Algorithm 1 Implementation of Sage Attention2. |
| Open Source Code | Yes | The code is available at https://github.com/ thu-ml/Sage Attention. |
| Open Datasets | Yes | Llama2 (7B) (Touvron et al., 2023), Llama3.1 (8B) (Dubey et al., 2024), and GLM4 (9B) (GLM et al., 2024) for text2text, Cogvideo X (2B), Cogvideo X (1.5-5B) (Yang et al., 2025b), Hunyuan Video (Kong et al., 2024), and Mochi (Team, 2024) for text2video, Flux (schnell) (Black Forest Labs, 2023) and Stable-Diffusion3.5 (turbo) (Stability AI, 2023) for text2image, and TIMM (Wightman, 2019) for image classification. ... For Details about the datasets and metrics we used, please refer to Appendix. A.7. ... Datasets. Text-to-text models are evaluated on four zero-shot tasks: Wiki Text (Merity et al., 2022) to assess the model s prediction confidence, LAMBADA (Paperno et al., 2016) evaluate contextual understanding, MMLU (Hendrycks et al., 2021b) for measuring knowledge across various subjects, and Longbench (Bai et al., 2024) for comprehensive assessment of long context understanding capabilities. Text-to-video models are evaluated using the open-sora (Zheng et al., 2024c) prompt sets. Text-to-image models are assessed on MJHQ-30K (Li et al., 2024). TIMM is evaluated on on three image datasets: Image Net (Deng et al., 2009), Image Net-Sketch (Sketch) (Wang et al., 2019), and Image Net-Rendition (Image Netr) (Hendrycks et al., 2021a). |
| Dataset Splits | Yes | We evaluate Qwen2-Audio (7b) (Chu et al., 2024), a speech-to-text model, on the ASR task using the Librispeech (Panayotov et al., 2015) test split and measured its performance with the WER metric (Word Error Rate). |
| Hardware Specification | Yes | The operations per second (OPS) of Sage Attention2 surpass Flash Attention2 and xformers by about 3x and 4.5x. Moreover, Sage Attention2 matches the speed of Flash Attention3(fp8) on the Hopper GPUs, while delivering much higher accuracy. ... We offer a high-performance implementation of Sage Attention2 on RTX4090 and L20 GPUs. ... Specifically, for mma(f32f8f8f32) instruction C = AB +D, where A, B are FP8 matrices and C, D are FP32 matrices, we initialize the A, B to zero and vary D to test the data type of the accumulator. When D is initialized with 1 sign bit, 8 exponent bits, and 13 mantissa bits, the value of C exactly matches D. However, when D is initialized with more than 13 mantissa bits, the value of C is equal to D with its least significant 10 mantissa bits zeroed out (i.e., truncated). Consequently, matrix multiplication of e PV , quantized to FP8, incurs a certain degree of accuracy loss compared to using an FP32 accumulator. ... Table 9. Speedup of different attention methods on various GPUs. Method 3090 4090 A100 L40 L20 H100 H20 |
| Software Dependencies | No | The paper mentions CUDA implicitly through PTX instructions and refers to CUTLASS (NVIDIA, 2023) but does not provide specific version numbers for these or other relevant software libraries (e.g., PyTorch, numpy, etc.) used in their experimental setup. |
| Experiment Setup | Yes | We benchmark kernel speed with a batch size of 4 and 32 attention heads across a variety of sequence lengths. Benchmarks are conducted using head dimensions of 64 and 128, both with and without Causal Mask (Vaswani, 2017). ... For floating-point data types, inputs are drawn from a Gaussian distribution with mean 0 and standard deviation 1, while for integer data types, inputs are uniformly sampled within the representation range:[-128, 127] for INT8 and [-8, 7] for INT4. |