Fast Attention Requires Bounded Entries

Authors: Josh Alman, Zhao Song

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Theoretical We present two results, showing that there is a sharp transition at B = Θ( log n). If d = O(log n) and B = o( log n), there is an n1+o(1) time algorithm to approximate Att(Q, K, V ) up to 1/poly(n) additive error. If d = O(log n) and B = Θ( log n), assuming the Strong Exponential Time Hypothesis from fine-grained complexity theory, it is impossible to approximate Att(Q, K, V ) up to 1/poly(n) additive error in truly subquadratic time n2 Ω(1).
Researcher Affiliation Collaboration josh@cs.columbia.edu. Columbia University. zsong@adobe.com. Adobe Research.
Pseudocode Yes Algorithm 1 Our Algorithm
Open Source Code No The paper does not contain any explicit statement about releasing source code or provide a link to a code repository for the methodology described.
Open Datasets No This is a theoretical paper that does not involve training models on datasets.
Dataset Splits No This is a theoretical paper and does not involve empirical validation with dataset splits.
Hardware Specification No The paper is theoretical and does not describe empirical experiments, thus no hardware specifications are mentioned.
Software Dependencies No The paper is theoretical and does not discuss specific software dependencies or versions for implementation.
Experiment Setup No The paper is theoretical and does not describe empirical experiments with specific setup details like hyperparameters or training configurations.