Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Fast attention mechanisms: a tale of parallelism
Authors: Jingwen Liu, Hantao Yu, Clayton Sanford, Alexandr Andoni, Daniel Hsu
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We empirically test the performance of ANNA-transformer on the Match2 and induction heads tasks. Experimental details are given in Appendix G. Since Algorithm 1 is not differentiable, we train a softmax version of attention as a surrogate, and then distill from the trained model to an ANNA-transformer (based on Algorithm 1 with angular LSH [6]). Our softmax attention normalizes all the queries and keys in Q(X) and K(X) to have unit norm, and computes softmax(β Q(X)K(X)T)V (X) with a tunable temperature parameter β > 0. |
| Researcher Affiliation | Collaboration | Jingwen Liu Columbia University New York, NY EMAIL Hantao Yu Columbia University New York, NY EMAIL Clayton Sanford Google Research San Francisco, CA EMAIL Alexandr Andoni Columbia University New York, NY EMAIL Daniel Hsu Columbia University New York, NY EMAIL |
| Pseudocode | Yes | Algorithm 1 ANNA implementation with LSH family H, ℓhash tables, and z hash functions/table Algorithm 2 Linear memory ANNA implementation with LSH family H, ℓhash tables, and z hash functions/table |
| Open Source Code | No | We will try to clean up the code and provide open access for the camera-ready version. Our main claims only focus on theoretical properties and experiments do not serve as the main contribution. |
| Open Datasets | Yes | The Match2 dataset is generated the same way as [37] with context length N = 32 and upper bound M = 37. One-layer ANNA-transformers are able to achieve zero error with ℓ= 8 hash tables and z = 1 hash function per table. See Figure 1a for the detailed performance. For induction heads, we use the dataset from [55] with number of hops k = 1, context length N = 100 and alphabet size |Σ| = 4. |
| Dataset Splits | Yes | In this setting, ℓ 8, z = 1 can achieve 0 error on the test set with 256 test samples. |
| Hardware Specification | Yes | All the experiments are launched on 2 GPUs: NIVIDIA Titan RTX and NVIDIA Titan Xp. |
| Software Dependencies | No | The paper does not provide specific software versions for libraries or frameworks used, such as Python, PyTorch, or specific machine learning libraries. |
| Experiment Setup | Yes | We trained 3 models with β {0.1, 1, 10} respectively, with Adam optimizer on cross-entropy loss and learning rate 0.01. Each model has one layer, one attention head, embedding dimension m = 64 and an MLP with width 4m and Ge LU activation. The dataset size, batch size, training steps are 10000, 32, 20000 respectively. |