Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
SeerAttention: Self-distilled Attention Gating for Efficient Long-context Prefilling
Authors: Yizhao Gao, Zhichen Zeng, DaYou Du, Shijie Cao, Peiyuan Zhou, Jiaxing Qi, Junjie Lai, Hayden So, Ting Cao, Fan Yang, Mao Yang
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our evaluation results demonstrate that Seer Attention achieves better model accuracy and lower latency for long-context pre-filling compared to prior methods. Code is available at: https://github.com/microsoft/Seer Attention. 1 Introduction Attention is a fundamental mechanism in transformer-based LLMs [51]. Despite its effectiveness, the quadratic complexity of attention demands substantial computation and memory resources, limiting the scalability and efficiency of LLMs, especially for long-context windows. This challenge has become an active research topic in the community. One potential solution is to replace the quadratic attention with cheaper architectures like linear attention or recurrent networks [30, 20, 40, 47] with subquadratic complexity. While these approaches are more efficient, the majority of state-of-the-art large language models (LLMs) continue to use full attention to achieve better performance. ... 4 Experiments In this section, we evaluate both the accuracy and efficiency of Seer Attention. In our current experiments, block-size B for the Attn Gate and sparse kernel is fixed at 64 and Attn Gate solely applies in the prefill stage. |
| Researcher Affiliation | Collaboration | Yizhao Gao1 Zhichen Zeng2 Dayou Du3 Shijie Cao4 Peiyuan Zhou5 Jiaxing Qi5 Junjie Lai5 Hayden Kwok-Hay So1 Ting Cao6 Fan Yang4 Mao Yang4 1University of Hong Kong 2University of Washington 3University of Edinburgh 4Microsoft Research 5NVIDIA 6Tsinghua University |
| Pseudocode | Yes | Figure 8: Efficient Flash Attention kernel with pooling of attention map. Pseudo Code of Customized Flash-Attn with Max Pooled Attn Map Input: Q, K, V; Output: O, A for i from 1 to Tr Load Qi for j from 1 to Tc Load Kj, Vj Compute Sij = dot(Qi,Kj), rij = rowmax(Sij) Store rij Update mij = max(mi(j 1), rij), lij and Oij Compute final li, mi and Oi for j from 1 to Tc Load and Rescale aij = exp(rij mi)/li Compute and Store Aij = colmax(aij) Return O, A |
| Open Source Code | Yes | Code is available at: https://github.com/microsoft/Seer Attention. |
| Open Datasets | Yes | We use the Red Pajama [11] dataset for Attn Gate distillation, which are chunked into 64k with BOS and EOS tokens. Our training employs a learning rate of 1e-3 with cosine decay. We set the global batch size to 16 and conduct training for only 500 steps, leveraging Deep Speed [43] stage 2 optimization on A100 GPUs. As only Attn Gate parameters are learned and updated, the distillation process can be completed within around 40 A100 hours for 7B or 8B models. To prevent the quadratic memory explosion that occurs when saving the intermediate attention map for ground truth generation, we customized a Flash Attention kernel. This kernel directly outputs the 2D max-pooled ground truth on top of the original attention computation. Further details about this kernel can be found in A.1. |
| Dataset Splits | No | The paper mentions using datasets like Red Pajama, Long Bench, RULER, PG19, Hella Swag, MMLU, ARC-challenge, and GSM8K. For Red Pajama, it states it was "chunked into 64k with BOS and EOS tokens" for training, but it does not specify explicit train/test/validation splits for the main experiments. For benchmarks like Long Bench and RULER, while they have inherent splits, the paper does not detail the specific splits used by the authors for their experiments beyond saying "we follow a similar practice... that only applies sparsity in context rather than question in Seer Attention". |
| Hardware Specification | Yes | All the evaluation were run on A100 GPUs. ...the distillation process can be completed within around 40 A100 hours for 7B or 8B models. ...compared with Flash Attention-2 (full attention) on a single A100 GPU. ...global batch size 16 on AMD MI300x GPUs |
| Software Dependencies | No | Our training employs a learning rate of 1e-3 with cosine decay. We set the global batch size to 16 and conduct training for only 500 steps, leveraging Deep Speed [43] stage 2 optimization on A100 GPUs. As only Attn Gate parameters are learned and updated, the distillation process can be completed within around 40 A100 hours for 7B or 8B models. ...To overcome this, we developed a customized kernel based on Triton [49] that efficiently extracts the 2D-Max Pooled attention map by modifying the Flash Attention kernel while largely preserving its computation flow. ...We evaluate our customized Flash Attention kernel with 2DMax Pooled attention map for scalable training of Seer Attention by comparing against with Py Torch naΓ―ve manual attention implementation and Flash Attention-2. ...employing Deep Speed Ze RO-2, Adam W optimizer |
| Experiment Setup | Yes | Our training employs a learning rate of 1e-3 with cosine decay. We set the global batch size to 16 and conduct training for only 500 steps, leveraging Deep Speed [43] stage 2 optimization on A100 GPUs. In our current experiments, block-size B for the Attn Gate and sparse kernel is fixed at 64 and Attn Gate solely applies in the prefill stage. In this benchmark test, Seer Attention employs a threshold of 2e-3 for all Attn Gates. In this experiment, Seer Attention employs a threshold of 5e-4, which allows it to automatically adapt sparsity from approximately 10% for 4k data to around 85% for 128k data. Nevertheless, we evaluate Seer Attention accuracy performance under a very high threshold 3e-2 to achieve high sparsity. |