Simple linear attention language models balance the recall-throughput tradeoff

Authors: Simran Arora, Sabri Eyuboglu, Michael Zhang, Aman Timalsina, Silas Alberti, James Zou, Atri Rudra, Christopher Re

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental By applying experiments and theory to a broad set of architectures, we identify a key tradeoff between a model s state size and recall ability. We show that efficient alternatives to attention (e.g. H3, Mamba, RWKV) maintain a fixedsize recurrent state, but struggle at recall. We propose BASED a simple architecture combining linear and sliding window attention. By varying BASED window size and linear attention feature dimension, we can dial the state size and traverse the Pareto frontier of the recall-memory tradeoff curve, recovering the full quality of attention on one end and the small state size of attention-alternatives on the other. We train language models up to 1.3b parameters and show that BASED matches the strongest sub-quadratic models (e.g. Mamba) in perplexity and outperforms them on real-world recall-intensive tasks by 10.36 accuracy points. We further develop IO-aware algorithms that enable BASED to provide 24 higher throughput on language generation than Flash Attention-2, when generating 1024 tokens using 1.3b parameter models.
Researcher Affiliation Academia 1Stanford University 2University of Buffalo.
Pseudocode Yes To make our attention competitive in real-world wall-clock time and memory usage, we provide hardware-efficient CUDA algorithms for generation prefill (Algorithm 1) and decoding (Algorithm 2).
Open Source Code Yes Code for this work is provided at: https://github.com/Hazy Research/based.
Open Datasets Yes We pretrain language models from scratch at two parameter scales (360m and 1.3b parameters) on the Pile (Gao et al., 2020).
Dataset Splits No The paper does not explicitly provide the training/validation/test dataset splits used for the Pile dataset within the text, only mentioning evaluation on the test set.
Hardware Specification Yes Experiments were run using an H100 NVIDIA GPU and averaged over 20 repetitions. [...] All benchmarks is on a single NVIDIA H100 GPU, using CUDA cache graphs during next token prediction (NVIDIA, 2019).
Software Dependencies No The paper mentions using the 'Flash Attention code base' and 'CUDA cache graphs' but does not specify exact version numbers for programming languages, libraries, or other software components.
Experiment Setup Yes Table 7. BASED Training Settings, Table 8. Attention Training Settings, Table 9. Mamba Training Settings, Table 10. Hyena Training Settings, Table 11. Hyena Training Settings, Table 12. Hyena Training Settings, Table 13. Gated Linear Attention (GLA) Training Settings