KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache

Authors: Zirui Liu, Jiayi Yuan, Hongye Jin, Shaochen Zhong, Zhaozhuo Xu, Vladimir Braverman, Beidi Chen, Xia Hu

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conducted a comprehensive study on the element distribution in KV cache of popular LLMs. Our findings indicate that the key cache should be quantized per-channel, i.e., group elements along the channel dimension and quantize them together. In contrast, the value cache should be quantized per-token. From this analysis, we developed a tuning-free 2bit KV cache quantization algorithm, named KIVI. With the hardware-friendly implementation, KIVI can enable Llama (Llama-2), Falcon, and Mistral models to maintain almost the same quality while using 2.6 less peak memory usage (including the model weight). This reduction in memory usage enables up to 4 larger batch size, bringing 2.35 3.47 throughput on real LLM inference workload. The source code is available at https://github.com/jy-yuan/KIVI.
Researcher Affiliation Academia 1Department of Computer Science, Rice University 2Department of Computer Science, Texas A&M University 3Department of Computer Science, Stevens Institute of Technology 4Department of Electrical and Computer Engineering, Carnegie Mellon University.
Pseudocode Yes Algorithm 1: The KIVI Prefill & Decoding Algorithm
Open Source Code Yes The source code is available at https://github.com/jy-yuan/KIVI.
Open Datasets Yes Specifically, we adopt generation tasks from LM-Eval (Gao et al., 2021) for normal context length evaluation and Long Bench (Bai et al., 2023) for long context evaluation, respectively.
Dataset Splits No The paper does not explicitly provide specific training/test/validation dataset splits (e.g., percentages, sample counts, or citations to predefined splits) for reproducibility.
Hardware Specification Yes The hardware here is a single NVIDIA A100 GPU (80GB).
Software Dependencies No The paper mentions using the "Hugging Face Transformers codebase" and implements kernels in "Triton" and "CUDA" but does not specify exact version numbers for these software dependencies.
Experiment Setup Yes Following previous work (Sheng et al., 2023), the group size G in Algorithm 1 for quantization is set as 32 across all experiments, the residual length R for key and value cache is set to 128.