SparQ Attention: Bandwidth-Efficient LLM Inference
Authors: Luka Ribar, Ivan Chelombiev, Luke Hudlass-Galley, Charlie Blake, Carlo Luschi, Douglas Orr
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We show that Spar Q Attention brings up to 8 savings in attention data transfers without substantial drops in accuracy, by evaluating Llama 2 and 3, Mistral, Gemma and Pythia models on a wide range of downstream tasks. |
| Researcher Affiliation | Industry | 1Graphcore Research, United Kingdom 2Synthesia, United Kingdom (work done while at Graphcore Research). |
| Pseudocode | Yes | Algorithm 1 Spar Q Attention |
| Open Source Code | Yes | The full code can be found in Appendix B. |
| Open Datasets | Yes | For question answering, we use the SQu AD (Rajpurkar et al., 2016) and Trivia QA (Joshi et al., 2017) datasets in the open-book setting. Summarisation is evaluated on the CNN/Daily Mail dataset (See et al., 2017) using the ROUGE-L F-score (Lin, 2004) as the metric. We use the Wiki Text-103 dataset (Merity et al., 2016) with bits per character (BPC) for evaluating language modelling performance. We construct examples using the Tiny-Shakespeare dataset (Karpathy, 2015) |
| Dataset Splits | No | The paper mentions the total number of samples used for some tasks (e.g., 'SQu AD 1-shot (4000 samples)'), but it does not explicitly provide percentages or counts for train, validation, and test splits for the datasets used in its experiments. |
| Hardware Specification | Yes | Roofline analysis of Llama 2 7B on A100 (40GB) |
| Software Dependencies | Yes | All experiments use Py Torch 2.1.2+cu121 on Ubuntu AWS instances. |
| Experiment Setup | Yes | Table G1: Experimental setup. Spar Q Attention: Rank r {8, 16, 32, 64} Number of values k 128 Local window l k/4 |