FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

Authors: Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, Christopher Ré

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We empirically validate that FLASHATTENTION speeds up model training and improves model quality by modeling longer context. We also benchmark the runtime and memory footprint of FLASHATTENTION and block-sparse FLASHATTENTION compared to prior attention implementations. ... 4 Experiments
Researcher Affiliation Academia Department of Computer Science, Stanford University Department of Computer Science and Engineering, University at Buffalo, SUNY
Pseudocode Yes Algorithm 0 Standard Attention Implementation
Open Source Code Yes We open-source FLASHATTENTION to make it easier to build on this primitive.1 FLASHATTENTION code is available at https://github.com/Hazy Research/flash-attention
Open Datasets Yes We train BERT-large (seq. length 512) ... GPT2 (seq. length 1K) on the large Open Webtext dataset [34]... long-range arena (LRA [83]) benchmark... MIMIC-III [49] and ECt HR [6, 7] datasets.
Dataset Splits Yes Appendix E includes plots of the validation perplexity throughout training, confirming that FLASHATTENTION is as numerically stable as the baselines and produces the same training / validation curves.
Hardware Specification Yes on one A100 GPU with 40 GB HBM
Software Dependencies No The paper mentions 'Py Torch' and states 'Our implementation uses Apex s FMHA code (https://github.com/NVIDIA/apex/tree/master/apex/contrib/csrc/fmha) as a starting point,' but it does not specify version numbers for any software dependencies like PyTorch, CUDA, or Apex itself.
Experiment Setup No The paper refers to 'Additional experiment details are in Appendix E.' and for LRA, 'We follow the implementation and experimental setting in Tay et al. [83]and Xiong et al. [94].' However, it does not provide specific hyperparameter values or system-level training settings within the main body of the paper.