HyperAttention: Long-context Attention in Near-Linear Time

Authors: Insu Han, Rajesh Jayaram, Amin Karbasi, Vahab Mirrokni, David Woodruff, Amir Zandieh

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically, employing Locality Sensitive Hashing (LSH) to identify large entries, Hyper Attention outperforms existing methods, giving significant speed improvements compared to state-of-the-art solutions like Flash Attention. We validate the empirical performance of Hyper Attention on a variety of different long-context length datasets.
Researcher Affiliation Collaboration Insu Han Yale University insu.han@yale.edu Rajesh Jayaram Google Research rkjayaram@google.com Amin Karbasi Yale University, Google Research amin.karbasi@yale.edu Vahab Mirrokni Google Research mirrokni@google.com David P. Woodruff CMU, Google Research dwoodruf@cs.cmu.edu Amir Zandieh Independent Researcher amir.zed510@gmail.com
Pseudocode Yes Algorithm 1: sort LSH for locating large entries of A; Algorithm 2: Approx D for estimating diagonal matrix D; Algorithm 3: Hyper Attention: attention mechanism in near-linear time; Algorithm 4: Causal Approx D, recursive approximation of DC for causal masking
Open Source Code No The paper does not include an unambiguous statement about releasing code or a direct link to a code repository.
Open Datasets Yes We use Long Bench (Bai et al., 2023), a collection of long context benchmark datasets, which contains 6 different tasks ranging from single and multiple-document question answering, summariza-tion, few-shot learning, synthetic tasks, and code completion.
Dataset Splits No The paper mentions evaluating perplexity and trimming/padding input context, but does not explicitly describe train/validation/test dataset splits with percentages, sample counts, or citations to predefined splits. It uses benchmark datasets where splits are often standard but doesn't state them within the paper.
Hardware Specification Yes All experiments are performed on a single A100 GPU with 40 GB memory and we use Flash Attention 2 (Dao, 2023) for the exact attention computation.
Software Dependencies No The paper mentions "Flash Attention 2 (Dao, 2023)" but does not provide specific version numbers for other key software components, libraries, or programming languages used (e.g., Python, PyTorch, CUDA versions).
Experiment Setup Yes All inputs Q, K, V have the same length and their dimensions are fixed to d = 64 and the number of attention heads is set by 12. We chose the same parameters in Hyper Attention as described in the previous section. In Fig. 4, we observe that Hyper Attention runs to up 54 faster without causal masking and 5.4 when the causal masking applies. ... We set both bucket size b and the number of sampled columns m to 256 for all sequence lengths.