Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

FlashBias: Fast Computation of Attention with Bias

Authors: Haixu Wu, Minghao Guo, Yuezhou Ma, Yuanxu Sun, Jianmin Wang, Wojciech Matusik, Mingsheng Long

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Flash Bias can achieve 1.5 speedup for Pairformer in Alpha Fold 3, and over 2 speedup for attention with bias in vision and language models without loss of accuracy. Code is available at this repository: https://github.com/thuml/Flash Bias. (...) As shown in Figure 3 and 4, we can find that Flash Bias (red lines) consistently outperforms Flash Attention with Bias [9] and Flex Attention [11], demonstrating the effectiveness of our design. (...) Results in Table 3 demonstrate that Flash Bias still outperforms Flash Attention and Flex Attention in processing the ALi Bi bias term.
Researcher Affiliation	Academia	1School of Software, Tsinghua University, 2MIT CSAIL
Pseudocode	No	The paper describes its methodology using mathematical equations and textual explanations, for example, Equations (1), (2), and (5), and sections like '3.1 Rethinking Flash Attention Computation' and '3.2 Flash Bias'. However, it does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code	Yes	Code is available at this repository: https://github.com/thuml/Flash Bias.
Open Datasets	Yes	We test Flash Bias on the image classification task based on Swin Transformer v2 [24]. Specifically, we adopt the open-source Swin V2-B model2. This model contains 24 layers with input resolution as 384 384 and window size as 24 24, thus, the sequence length of its Window Attention is 576. (...) For comparison, the full training process of Alpha Fold 3 will take about 7 days on 128 A100 GPUs. (...) Dataset Train set: weighted PDB_before2109_wopb_nometalc_0925 Test set: recent PDB_1536_sample384_0925 (...) Pangu-Weather [4] is a significant step in adopting Transformers for global weather forecasting. Specifically, its backbone is a 3D Swin Transformer with a hierarchical structure, which contains two different scales. (...) Table 7: Experiments of Pangu-Weather [4] on ERA5 [15].
Dataset Splits	Yes	Dataset Train set: weighted PDB_before2109_wopb_nometalc_0925 Test set: recent PDB_1536_sample384_0925
Hardware Specification	Yes	All experiments were performed in Py Torch 2.5.0 [27] and Triton 3.0.0 [38] on a single A100 GPU.
Software Dependencies	Yes	All experiments were performed in Py Torch 2.5.0 [27] and Triton 3.0.0 [38] on a single A100 GPU.
Experiment Setup	Yes	Setups To give a clear comparison among Flash Bias, vanilla Flash Attention with Bias [9] and the latest Flex Attention [11], we make a comprehensive efficiency evaluation based on a plain Transformer [40], which consists of 8 layers. Each Transformer layer involves a feedforward network with 1024 intermediate hidden channels and attention with 512 hidden channels, 8 heads, as well as a static bias matrix of shape # heads N N. All the metrics are recorded with a batch size of 1. (...) We test Flash Bias on the image classification task based on Swin Transformer v2 [24]. Specifically, we adopt the open-source Swin V2-B model2. This model contains 24 layers with input resolution as 384 384 and window size as 24 24, thus, the sequence length of its Window Attention is 576. Window Attention in every layer contains a relative position bias with size # heads 576 576, which is set as a learnable model parameter. We attempt to speed up the Window Attention computation with Flash Bias. All the efficiency metrics are evaluated under a batch size of 64. (...) Initial learning rate: 0.001; Optimizer: Adam; Training Learning rate decay: every 50 iterations, reduce to the origin s 0.95 Overall steps: 10,000 iterations