reproducibilityindex.ai

SparQ Attention: Bandwidth-Efficient LLM Inference

Authors: Luka Ribar, Ivan Chelombiev, Luke Hudlass-Galley, Charlie Blake, Carlo Luschi, Douglas Orr

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We show that Spar Q Attention brings up to 8 savings in attention data transfers without substantial drops in accuracy, by evaluating Llama 2 and 3, Mistral, Gemma and Pythia models on a wide range of downstream tasks.
Researcher Affiliation	Industry	1Graphcore Research, United Kingdom 2Synthesia, United Kingdom (work done while at Graphcore Research).
Pseudocode	Yes	Algorithm 1 Spar Q Attention
Open Source Code	Yes	The full code can be found in Appendix B.
Open Datasets	Yes	For question answering, we use the SQu AD (Rajpurkar et al., 2016) and Trivia QA (Joshi et al., 2017) datasets in the open-book setting. Summarisation is evaluated on the CNN/Daily Mail dataset (See et al., 2017) using the ROUGE-L F-score (Lin, 2004) as the metric. We use the Wiki Text-103 dataset (Merity et al., 2016) with bits per character (BPC) for evaluating language modelling performance. We construct examples using the Tiny-Shakespeare dataset (Karpathy, 2015)
Dataset Splits	No	The paper mentions the total number of samples used for some tasks (e.g., 'SQu AD 1-shot (4000 samples)'), but it does not explicitly provide percentages or counts for train, validation, and test splits for the datasets used in its experiments.
Hardware Specification	Yes	Rooﬂine analysis of Llama 2 7B on A100 (40GB)
Software Dependencies	Yes	All experiments use Py Torch 2.1.2+cu121 on Ubuntu AWS instances.
Experiment Setup	Yes	Table G1: Experimental setup. Spar Q Attention: Rank r {8, 16, 32, 64} Number of values k 128 Local window l k/4