ZipCache: Accurate and Efficient KV Cache Quantization with Salient Token Identification
Authors: Yefei He, Luoming Zhang, Weijia Wu, Jing Liu, Hong Zhou, Bohan Zhuang
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments demonstrate that Zip Cache achieves superior compression ratios, fast generation speed and minimal performance losses compared with previous KV cache compression methods. |
| Researcher Affiliation | Academia | Yefei He1 Luoming Zhang1 Weijia Wu2 Jing Liu3 Hong Zhou1 Bohan Zhuang1,3 1Zhejiang University, China 2National University of Singapore, Singapore 3ZIP Lab, Monash University, Australia |
| Pseudocode | Yes | Algorithm 1: Channel-separable Tokenwise Quantization (CSTQuant) procedure CSTQuant: Input: data X Rl hd, target bit-width k for i 0 to hd do max(|Xi|) Xi = Xi ci // Normalizing each channel of X ˆX =Token Quant(X, k) // Do tokenwise quantization for i 0 to hd do ˆXi = ˆXi ci // Rescale each channel of X |
| Open Source Code | Yes | Code is available at https://github.com/Thisis Billhe/Zip Cache/. |
| Open Datasets | Yes | Models and datasets. To validate the efficacy of our proposed method, we conduct experiments with three open-source LLMs: Mistral [20], LLa MA2 [37] and LLa MA3. These models are evaluated on three challenging benchmarks: GSM8k [6] for math problem solving, Human Eval [4] for code generation, and Line Retrieval [25] for data retrieval. |
| Dataset Splits | No | The paper mentions evaluating models on datasets but does not explicitly state the train/validation/test splits used for these datasets. |
| Hardware Specification | Yes | Data is collected by serving LLa MA3-8B model on a Nvidia A100 GPU. |
| Software Dependencies | No | The paper does not provide specific version numbers for software dependencies such as programming languages, libraries, or frameworks. |
| Experiment Setup | Yes | We employ mixed precision quantization for KV cache where salient tokens will be quantized to 4-bit while the remaining will be quantized to 2-bit. For both subsets, we apply channelwise quantization for the key cache and channel-separable tokenwise quantization for the value cache. The proportion of salient tokens will be denoted by "Saliency Ratio" in the experimental results. During the decoding process, Zip Cache adopts a streaming strategy [21] and repeats the compression process for the KV cache whenever 100 new tokens are generated. |