Simple linear attention language models balance the recall-throughput tradeoff
Authors: Simran Arora, Sabri Eyuboglu, Michael Zhang, Aman Timalsina, Silas Alberti, James Zou, Atri Rudra, Christopher Re
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | By applying experiments and theory to a broad set of architectures, we identify a key tradeoff between a model s state size and recall ability. We show that efficient alternatives to attention (e.g. H3, Mamba, RWKV) maintain a fixedsize recurrent state, but struggle at recall. We propose BASED a simple architecture combining linear and sliding window attention. By varying BASED window size and linear attention feature dimension, we can dial the state size and traverse the Pareto frontier of the recall-memory tradeoff curve, recovering the full quality of attention on one end and the small state size of attention-alternatives on the other. We train language models up to 1.3b parameters and show that BASED matches the strongest sub-quadratic models (e.g. Mamba) in perplexity and outperforms them on real-world recall-intensive tasks by 10.36 accuracy points. We further develop IO-aware algorithms that enable BASED to provide 24 higher throughput on language generation than Flash Attention-2, when generating 1024 tokens using 1.3b parameter models. |
| Researcher Affiliation | Academia | 1Stanford University 2University of Buffalo. |
| Pseudocode | Yes | To make our attention competitive in real-world wall-clock time and memory usage, we provide hardware-efficient CUDA algorithms for generation prefill (Algorithm 1) and decoding (Algorithm 2). |
| Open Source Code | Yes | Code for this work is provided at: https://github.com/Hazy Research/based. |
| Open Datasets | Yes | We pretrain language models from scratch at two parameter scales (360m and 1.3b parameters) on the Pile (Gao et al., 2020). |
| Dataset Splits | No | The paper does not explicitly provide the training/validation/test dataset splits used for the Pile dataset within the text, only mentioning evaluation on the test set. |
| Hardware Specification | Yes | Experiments were run using an H100 NVIDIA GPU and averaged over 20 repetitions. [...] All benchmarks is on a single NVIDIA H100 GPU, using CUDA cache graphs during next token prediction (NVIDIA, 2019). |
| Software Dependencies | No | The paper mentions using the 'Flash Attention code base' and 'CUDA cache graphs' but does not specify exact version numbers for programming languages, libraries, or other software components. |
| Experiment Setup | Yes | Table 7. BASED Training Settings, Table 8. Attention Training Settings, Table 9. Mamba Training Settings, Table 10. Hyena Training Settings, Table 11. Hyena Training Settings, Table 12. Hyena Training Settings, Table 13. Gated Linear Attention (GLA) Training Settings |