reproducibilityindex.ai

SnapKV: LLM Knows What You are Looking for Before Generation

Authors: Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, Deming Chen

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate Snap KV across diverse LLMs and long-sequence datasets. Snap KV shows comparable accuracy with full KV caching method while achieving improved decoding speed and memory efficiency. Meanwhile, we conduct the pressure test with Needle-in-a-Haystack to further demonstrate its memory efficiency and information retrieval ability. ... In our experimental setup, we explore the performance of Snap KV across models that can handle extended prompt sequence contexts. First, we deliver a pressure test and benchmark the speed of LWM-Text-Chat-1M [17], which is state-of-the-art regarding its context length. We then conduct an ablation study on Mistral-7B-Instruct-v0.2 to understand the influence of pooling on the model s information retrieval performance. We assess model performances using the Long Bench [18] dataset.
Researcher Affiliation	Collaboration	Yuhong Li1 Yingbing Huang1 Bowen Yang2 Bharat Venkitesh2 Acyr Locatelli2 Hanchen Ye1 Tianle Cai3 Patrick Lewis2 Deming Chen1 1 University of Illinois Urbana-Champaign 2 Cohere 3 Princeton University
Pseudocode	Yes	Listing 1: Implementation of Snap KV in pseudo Py Torch style.
Open Source Code	Yes	Our code is available at https://github.com/Faster Decoding/Snap KV.
Open Datasets	Yes	Our analysis utilizes samples from Ultrachat [12], a multi-turns, high-quality instruction dataset consisting of 1.4 million dialogues. ... We analyze its hit rate on multiple long documents QA datasets including QMSum [13], a query-based multi-domain meeting summarization; Openreview [14], a collection of papers from openreview.net; SPACE [15], an extractive opinion summarization in quantized transformer spaces. ... We assess model performances using the Long Bench [18] dataset. ... The experiments utilized a subset of the QASPER [31]
Dataset Splits	No	The paper refers to 'train', 'test', and 'evaluation' of models but does not explicitly mention a 'validation' dataset or its specific split details for hyperparameter tuning or early stopping during training. The term 'validation' itself is not used in the context of dataset splits for their experiments.
Hardware Specification	Yes	Specifically, Snap KV can process up to 380K context tokens on a single A100-80GB GPU... All experiments are conducted on an A100 80GB GPU. ... We evaluate the prefilling time and memory usage on Mistral-7B-Instruct-v0.2 with input sequence lengths ranging from 5k to 45k in Fig. 10. The results show no overhead in either aspect. Snap KV only introduces extra top-k and pooling operations which are trivial regarding computation complexity compared with original prefilling calculations. Figure 10: The prefilling time and maximum memory allocated comparison between Mistral-7B-Instruct-v0.2 with and without Snap KV on an H100.
Software Dependencies	No	The paper mentions 'Py Torch-style pseudo code' and 'Hugging Face implementation' but does not specify version numbers for these or any other software dependencies needed to replicate the experiments.
Experiment Setup	Yes	We configured the prompt KV cache size to 1024, enabling Snap KV to select the most crucial 1024 attention features from the prompt for answer generation, with a maximum pooling kernel size of 5 and an observation window size of 16, both of which are hyperparameters that can be customized. ... We set the maximum KV cache size as 2048 for Snap KV, and fix the generation length at 512 to ensure a fair comparison. ... We apply max pooling with a kernel size of 5 and use the observation window with a size of 16... For each model, we test Snap KV with various settings: compressing KV caches in the prompt to 1024, 2048, and 4096 tokens. We use max pooling with kernel size 7 and observation window size 32. ... In Snap KV, we introduce two key hyperparameters: observation window size and pooling kernel size.