Loki: Low-rank Keys for Efficient Sparse Attention

Authors: Prajwal Singhania, Siddharth Singh, Shwai He, Soheil Feizi, Abhinav Bhatele

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our evaluations show that Loki is able to speed up the attention computation due to reduced data movement (load/store) and compute costs while maintaining the efficacy of the models better than other popular approximation methods. Evaluation of Loki1 on multiple LLMs and downstream tasks, showing that it can achieve significant speedups with minimal degradation in model quality.
Researcher Affiliation Academia Prajwal Singhania, Siddharth Singh, Shwai He, Soheil Feizi, Abhinav Bhatele Department of Computer Science, University of Maryland College Park, MD 20742 prajwal@umd.edu, bhatele@cs.umd.edu
Pseudocode Yes Algorithm 1 Loki: PCA-based Top-K Attention
Open Source Code Yes Our work is conducted under the MIT License and our code is released under the MIT License. The link to the Github repository is provided in the paper. For the original paper submission, we provided our code in the supplemental material. 1https://github.com/hpcgroup/loki
Open Datasets Yes To investigate the dimensionality of attention keys, we run 11 transformer-based models: Llama-2 7B/13B/70B [31], Llama-3 8B/70B [8], Tiny Llama-1.1B [38], Pythia-6.9B [4], Mistral-7B [15], Mixtral-8x7B/8x22B [16], and Phi3-Mini-4K [22] on three popular English language datasets: Wiki Text-2 [21] (Validation Split), C4 [25] (Custom Split), and Book Corpus [40] (Custom Split).
Dataset Splits Yes To investigate the dimensionality of attention keys, we run 11 transformer-based models... on three popular English language datasets: Wiki Text-2 [21] (Validation Split), C4 [25] (Custom Split), and Book Corpus [40] (Custom Split). Custom splits are used for datasets where the validation split is not available.
Hardware Specification Yes All experiments are run on NVIDIA A100 GPUs with 40 and 80 GB of memory on the Perlmutter [24] supercomputer.
Software Dependencies No Thus, we implement optimized sparse matrix multiplication kernels for Loki in Triton, leading to a speedup of up to 45% over the standard Hugging Face Transformer s [34] attention implementation (vanilla attention) for Llama2-13B. The paper does not provide specific version numbers for these software dependencies.
Experiment Setup Yes We compare our method against three methods full attention without any approximations, the exact Top K approach which computes the exact attention scores and then uses the top-k tokens to compute the final output, and H2O [39], a popular token-eviction method. For these comparisons, we show the results with a budget size of kf = 0.25 and 0.125. For our method, we additionally use df = 0.25 and 0.125. This configuration of our represents a 2.6x theoretical speedup. We use a batch size of 16 and report the average time over 10 trials (std. dev. in measured times was less than 0.05 percent of the mean). For Llama2-13B, prompt length = 3072, generation length = 512.