ZipCache: Accurate and Efficient KV Cache Quantization with Salient Token Identification

Authors: Yefei He, Luoming Zhang, Weijia Wu, Jing Liu, Hong Zhou, Bohan Zhuang

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments demonstrate that Zip Cache achieves superior compression ratios, fast generation speed and minimal performance losses compared with previous KV cache compression methods.
Researcher Affiliation Academia Yefei He1 Luoming Zhang1 Weijia Wu2 Jing Liu3 Hong Zhou1 Bohan Zhuang1,3 1Zhejiang University, China 2National University of Singapore, Singapore 3ZIP Lab, Monash University, Australia
Pseudocode Yes Algorithm 1: Channel-separable Tokenwise Quantization (CSTQuant) procedure CSTQuant: Input: data X Rl hd, target bit-width k for i 0 to hd do max(|Xi|) Xi = Xi ci // Normalizing each channel of X ˆX =Token Quant(X, k) // Do tokenwise quantization for i 0 to hd do ˆXi = ˆXi ci // Rescale each channel of X
Open Source Code Yes Code is available at https://github.com/Thisis Billhe/Zip Cache/.
Open Datasets Yes Models and datasets. To validate the efficacy of our proposed method, we conduct experiments with three open-source LLMs: Mistral [20], LLa MA2 [37] and LLa MA3. These models are evaluated on three challenging benchmarks: GSM8k [6] for math problem solving, Human Eval [4] for code generation, and Line Retrieval [25] for data retrieval.
Dataset Splits No The paper mentions evaluating models on datasets but does not explicitly state the train/validation/test splits used for these datasets.
Hardware Specification Yes Data is collected by serving LLa MA3-8B model on a Nvidia A100 GPU.
Software Dependencies No The paper does not provide specific version numbers for software dependencies such as programming languages, libraries, or frameworks.
Experiment Setup Yes We employ mixed precision quantization for KV cache where salient tokens will be quantized to 4-bit while the remaining will be quantized to 2-bit. For both subsets, we apply channelwise quantization for the key cache and channel-separable tokenwise quantization for the value cache. The proportion of salient tokens will be denoted by "Saliency Ratio" in the experimental results. During the decoding process, Zip Cache adopts a streaming strategy [21] and repeats the compression process for the KV cache whenever 100 new tokens are generated.