Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Fast attention mechanisms: a tale of parallelism

Authors: Jingwen Liu, Hantao Yu, Clayton Sanford, Alexandr Andoni, Daniel Hsu

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We empirically test the performance of ANNA-transformer on the Match2 and induction heads tasks. Experimental details are given in Appendix G. Since Algorithm 1 is not differentiable, we train a softmax version of attention as a surrogate, and then distill from the trained model to an ANNA-transformer (based on Algorithm 1 with angular LSH [6]). Our softmax attention normalizes all the queries and keys in Q(X) and K(X) to have unit norm, and computes softmax(β Q(X)K(X)T)V (X) with a tunable temperature parameter β > 0.
Researcher Affiliation Collaboration Jingwen Liu Columbia University New York, NY EMAIL Hantao Yu Columbia University New York, NY EMAIL Clayton Sanford Google Research San Francisco, CA EMAIL Alexandr Andoni Columbia University New York, NY EMAIL Daniel Hsu Columbia University New York, NY EMAIL
Pseudocode Yes Algorithm 1 ANNA implementation with LSH family H, ℓhash tables, and z hash functions/table Algorithm 2 Linear memory ANNA implementation with LSH family H, ℓhash tables, and z hash functions/table
Open Source Code No We will try to clean up the code and provide open access for the camera-ready version. Our main claims only focus on theoretical properties and experiments do not serve as the main contribution.
Open Datasets Yes The Match2 dataset is generated the same way as [37] with context length N = 32 and upper bound M = 37. One-layer ANNA-transformers are able to achieve zero error with ℓ= 8 hash tables and z = 1 hash function per table. See Figure 1a for the detailed performance. For induction heads, we use the dataset from [55] with number of hops k = 1, context length N = 100 and alphabet size |Σ| = 4.
Dataset Splits Yes In this setting, ℓ 8, z = 1 can achieve 0 error on the test set with 256 test samples.
Hardware Specification Yes All the experiments are launched on 2 GPUs: NIVIDIA Titan RTX and NVIDIA Titan Xp.
Software Dependencies No The paper does not provide specific software versions for libraries or frameworks used, such as Python, PyTorch, or specific machine learning libraries.
Experiment Setup Yes We trained 3 models with β {0.1, 1, 10} respectively, with Adam optimizer on cross-entropy loss and learning rate 0.01. Each model has one layer, one attention head, embedding dimension m = 64 and an MLP with width 4m and Ge LU activation. The dataset size, batch size, training steps are 10000, 32, 20000 respectively.