FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
Authors: Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, Christopher Ré
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We empirically validate that FLASHATTENTION speeds up model training and improves model quality by modeling longer context. We also benchmark the runtime and memory footprint of FLASHATTENTION and block-sparse FLASHATTENTION compared to prior attention implementations. ... 4 Experiments |
| Researcher Affiliation | Academia | Department of Computer Science, Stanford University Department of Computer Science and Engineering, University at Buffalo, SUNY |
| Pseudocode | Yes | Algorithm 0 Standard Attention Implementation |
| Open Source Code | Yes | We open-source FLASHATTENTION to make it easier to build on this primitive.1 FLASHATTENTION code is available at https://github.com/Hazy Research/flash-attention |
| Open Datasets | Yes | We train BERT-large (seq. length 512) ... GPT2 (seq. length 1K) on the large Open Webtext dataset [34]... long-range arena (LRA [83]) benchmark... MIMIC-III [49] and ECt HR [6, 7] datasets. |
| Dataset Splits | Yes | Appendix E includes plots of the validation perplexity throughout training, confirming that FLASHATTENTION is as numerically stable as the baselines and produces the same training / validation curves. |
| Hardware Specification | Yes | on one A100 GPU with 40 GB HBM |
| Software Dependencies | No | The paper mentions 'Py Torch' and states 'Our implementation uses Apex s FMHA code (https://github.com/NVIDIA/apex/tree/master/apex/contrib/csrc/fmha) as a starting point,' but it does not specify version numbers for any software dependencies like PyTorch, CUDA, or Apex itself. |
| Experiment Setup | No | The paper refers to 'Additional experiment details are in Appendix E.' and for LRA, 'We follow the implementation and experimental setting in Tay et al. [83]and Xiong et al. [94].' However, it does not provide specific hyperparameter values or system-level training settings within the main body of the paper. |