reproducibilityindex.ai

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

Authors: Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, Christopher Ré

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We empirically validate that FLASHATTENTION speeds up model training and improves model quality by modeling longer context. We also benchmark the runtime and memory footprint of FLASHATTENTION and block-sparse FLASHATTENTION compared to prior attention implementations. ... 4 Experiments
Researcher Affiliation	Academia	Department of Computer Science, Stanford University Department of Computer Science and Engineering, University at Buffalo, SUNY
Pseudocode	Yes	Algorithm 0 Standard Attention Implementation
Open Source Code	Yes	We open-source FLASHATTENTION to make it easier to build on this primitive.1 FLASHATTENTION code is available at https://github.com/Hazy Research/flash-attention
Open Datasets	Yes	We train BERT-large (seq. length 512) ... GPT2 (seq. length 1K) on the large Open Webtext dataset [34]... long-range arena (LRA [83]) benchmark... MIMIC-III [49] and ECt HR [6, 7] datasets.
Dataset Splits	Yes	Appendix E includes plots of the validation perplexity throughout training, conﬁrming that FLASHATTENTION is as numerically stable as the baselines and produces the same training / validation curves.
Hardware Specification	Yes	on one A100 GPU with 40 GB HBM
Software Dependencies	No	The paper mentions 'Py Torch' and states 'Our implementation uses Apex s FMHA code (https://github.com/NVIDIA/apex/tree/master/apex/contrib/csrc/fmha) as a starting point,' but it does not specify version numbers for any software dependencies like PyTorch, CUDA, or Apex itself.
Experiment Setup	No	The paper refers to 'Additional experiment details are in Appendix E.' and for LRA, 'We follow the implementation and experimental setting in Tay et al. [83]and Xiong et al. [94].' However, it does not provide specific hyperparameter values or system-level training settings within the main body of the paper.