Loki: Low-rank Keys for Efficient Sparse Attention
Authors: Prajwal Singhania, Siddharth Singh, Shwai He, Soheil Feizi, Abhinav Bhatele
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our evaluations show that Loki is able to speed up the attention computation due to reduced data movement (load/store) and compute costs while maintaining the efficacy of the models better than other popular approximation methods. Evaluation of Loki1 on multiple LLMs and downstream tasks, showing that it can achieve significant speedups with minimal degradation in model quality. |
| Researcher Affiliation | Academia | Prajwal Singhania, Siddharth Singh, Shwai He, Soheil Feizi, Abhinav Bhatele Department of Computer Science, University of Maryland College Park, MD 20742 prajwal@umd.edu, bhatele@cs.umd.edu |
| Pseudocode | Yes | Algorithm 1 Loki: PCA-based Top-K Attention |
| Open Source Code | Yes | Our work is conducted under the MIT License and our code is released under the MIT License. The link to the Github repository is provided in the paper. For the original paper submission, we provided our code in the supplemental material. 1https://github.com/hpcgroup/loki |
| Open Datasets | Yes | To investigate the dimensionality of attention keys, we run 11 transformer-based models: Llama-2 7B/13B/70B [31], Llama-3 8B/70B [8], Tiny Llama-1.1B [38], Pythia-6.9B [4], Mistral-7B [15], Mixtral-8x7B/8x22B [16], and Phi3-Mini-4K [22] on three popular English language datasets: Wiki Text-2 [21] (Validation Split), C4 [25] (Custom Split), and Book Corpus [40] (Custom Split). |
| Dataset Splits | Yes | To investigate the dimensionality of attention keys, we run 11 transformer-based models... on three popular English language datasets: Wiki Text-2 [21] (Validation Split), C4 [25] (Custom Split), and Book Corpus [40] (Custom Split). Custom splits are used for datasets where the validation split is not available. |
| Hardware Specification | Yes | All experiments are run on NVIDIA A100 GPUs with 40 and 80 GB of memory on the Perlmutter [24] supercomputer. |
| Software Dependencies | No | Thus, we implement optimized sparse matrix multiplication kernels for Loki in Triton, leading to a speedup of up to 45% over the standard Hugging Face Transformer s [34] attention implementation (vanilla attention) for Llama2-13B. The paper does not provide specific version numbers for these software dependencies. |
| Experiment Setup | Yes | We compare our method against three methods full attention without any approximations, the exact Top K approach which computes the exact attention scores and then uses the top-k tokens to compute the final output, and H2O [39], a popular token-eviction method. For these comparisons, we show the results with a budget size of kf = 0.25 and 0.125. For our method, we additionally use df = 0.25 and 0.125. This configuration of our represents a 2.6x theoretical speedup. We use a batch size of 16 and report the average time over 10 trials (std. dev. in measured times was less than 0.05 percent of the mean). For Llama2-13B, prompt length = 3072, generation length = 512. |