SparQ Attention: Bandwidth-Efficient LLM Inference

Authors: Luka Ribar, Ivan Chelombiev, Luke Hudlass-Galley, Charlie Blake, Carlo Luschi, Douglas Orr

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We show that Spar Q Attention brings up to 8 savings in attention data transfers without substantial drops in accuracy, by evaluating Llama 2 and 3, Mistral, Gemma and Pythia models on a wide range of downstream tasks.
Researcher Affiliation Industry 1Graphcore Research, United Kingdom 2Synthesia, United Kingdom (work done while at Graphcore Research).
Pseudocode Yes Algorithm 1 Spar Q Attention
Open Source Code Yes The full code can be found in Appendix B.
Open Datasets Yes For question answering, we use the SQu AD (Rajpurkar et al., 2016) and Trivia QA (Joshi et al., 2017) datasets in the open-book setting. Summarisation is evaluated on the CNN/Daily Mail dataset (See et al., 2017) using the ROUGE-L F-score (Lin, 2004) as the metric. We use the Wiki Text-103 dataset (Merity et al., 2016) with bits per character (BPC) for evaluating language modelling performance. We construct examples using the Tiny-Shakespeare dataset (Karpathy, 2015)
Dataset Splits No The paper mentions the total number of samples used for some tasks (e.g., 'SQu AD 1-shot (4000 samples)'), but it does not explicitly provide percentages or counts for train, validation, and test splits for the datasets used in its experiments.
Hardware Specification Yes Roofline analysis of Llama 2 7B on A100 (40GB)
Software Dependencies Yes All experiments use Py Torch 2.1.2+cu121 on Ubuntu AWS instances.
Experiment Setup Yes Table G1: Experimental setup. Spar Q Attention: Rank r {8, 16, 32, 64} Number of values k 128 Local window l k/4