KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache
Authors: Zirui Liu, Jiayi Yuan, Hongye Jin, Shaochen Zhong, Zhaozhuo Xu, Vladimir Braverman, Beidi Chen, Xia Hu
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conducted a comprehensive study on the element distribution in KV cache of popular LLMs. Our findings indicate that the key cache should be quantized per-channel, i.e., group elements along the channel dimension and quantize them together. In contrast, the value cache should be quantized per-token. From this analysis, we developed a tuning-free 2bit KV cache quantization algorithm, named KIVI. With the hardware-friendly implementation, KIVI can enable Llama (Llama-2), Falcon, and Mistral models to maintain almost the same quality while using 2.6 less peak memory usage (including the model weight). This reduction in memory usage enables up to 4 larger batch size, bringing 2.35 3.47 throughput on real LLM inference workload. The source code is available at https://github.com/jy-yuan/KIVI. |
| Researcher Affiliation | Academia | 1Department of Computer Science, Rice University 2Department of Computer Science, Texas A&M University 3Department of Computer Science, Stevens Institute of Technology 4Department of Electrical and Computer Engineering, Carnegie Mellon University. |
| Pseudocode | Yes | Algorithm 1: The KIVI Prefill & Decoding Algorithm |
| Open Source Code | Yes | The source code is available at https://github.com/jy-yuan/KIVI. |
| Open Datasets | Yes | Specifically, we adopt generation tasks from LM-Eval (Gao et al., 2021) for normal context length evaluation and Long Bench (Bai et al., 2023) for long context evaluation, respectively. |
| Dataset Splits | No | The paper does not explicitly provide specific training/test/validation dataset splits (e.g., percentages, sample counts, or citations to predefined splits) for reproducibility. |
| Hardware Specification | Yes | The hardware here is a single NVIDIA A100 GPU (80GB). |
| Software Dependencies | No | The paper mentions using the "Hugging Face Transformers codebase" and implements kernels in "Triton" and "CUDA" but does not specify exact version numbers for these software dependencies. |
| Experiment Setup | Yes | Following previous work (Sheng et al., 2023), the group size G in Algorithm 1 for quantization is set as 32 across all experiments, the residual length R for key and value cache is set to 128. |