SnapKV: LLM Knows What You are Looking for Before Generation
Authors: Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, Deming Chen
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate Snap KV across diverse LLMs and long-sequence datasets. Snap KV shows comparable accuracy with full KV caching method while achieving improved decoding speed and memory efficiency. Meanwhile, we conduct the pressure test with Needle-in-a-Haystack to further demonstrate its memory efficiency and information retrieval ability. ... In our experimental setup, we explore the performance of Snap KV across models that can handle extended prompt sequence contexts. First, we deliver a pressure test and benchmark the speed of LWM-Text-Chat-1M [17], which is state-of-the-art regarding its context length. We then conduct an ablation study on Mistral-7B-Instruct-v0.2 to understand the influence of pooling on the model s information retrieval performance. We assess model performances using the Long Bench [18] dataset. |
| Researcher Affiliation | Collaboration | Yuhong Li1 Yingbing Huang1 Bowen Yang2 Bharat Venkitesh2 Acyr Locatelli2 Hanchen Ye1 Tianle Cai3 Patrick Lewis2 Deming Chen1 1 University of Illinois Urbana-Champaign 2 Cohere 3 Princeton University |
| Pseudocode | Yes | Listing 1: Implementation of Snap KV in pseudo Py Torch style. |
| Open Source Code | Yes | Our code is available at https://github.com/Faster Decoding/Snap KV. |
| Open Datasets | Yes | Our analysis utilizes samples from Ultrachat [12], a multi-turns, high-quality instruction dataset consisting of 1.4 million dialogues. ... We analyze its hit rate on multiple long documents QA datasets including QMSum [13], a query-based multi-domain meeting summarization; Openreview [14], a collection of papers from openreview.net; SPACE [15], an extractive opinion summarization in quantized transformer spaces. ... We assess model performances using the Long Bench [18] dataset. ... The experiments utilized a subset of the QASPER [31] |
| Dataset Splits | No | The paper refers to 'train', 'test', and 'evaluation' of models but does not explicitly mention a 'validation' dataset or its specific split details for hyperparameter tuning or early stopping during training. The term 'validation' itself is not used in the context of dataset splits for their experiments. |
| Hardware Specification | Yes | Specifically, Snap KV can process up to 380K context tokens on a single A100-80GB GPU... All experiments are conducted on an A100 80GB GPU. ... We evaluate the prefilling time and memory usage on Mistral-7B-Instruct-v0.2 with input sequence lengths ranging from 5k to 45k in Fig. 10. The results show no overhead in either aspect. Snap KV only introduces extra top-k and pooling operations which are trivial regarding computation complexity compared with original prefilling calculations. Figure 10: The prefilling time and maximum memory allocated comparison between Mistral-7B-Instruct-v0.2 with and without Snap KV on an H100. |
| Software Dependencies | No | The paper mentions 'Py Torch-style pseudo code' and 'Hugging Face implementation' but does not specify version numbers for these or any other software dependencies needed to replicate the experiments. |
| Experiment Setup | Yes | We configured the prompt KV cache size to 1024, enabling Snap KV to select the most crucial 1024 attention features from the prompt for answer generation, with a maximum pooling kernel size of 5 and an observation window size of 16, both of which are hyperparameters that can be customized. ... We set the maximum KV cache size as 2048 for Snap KV, and fix the generation length at 512 to ensure a fair comparison. ... We apply max pooling with a kernel size of 5 and use the observation window with a size of 16... For each model, we test Snap KV with various settings: compressing KV caches in the prompt to 1024, 2048, and 4096 tokens. We use max pooling with kernel size 7 and observation window size 32. ... In Snap KV, we introduce two key hyperparameters: observation window size and pooling kernel size. |