reproducibilityindex.ai

KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache

Authors: Zirui Liu, Jiayi Yuan, Hongye Jin, Shaochen Zhong, Zhaozhuo Xu, Vladimir Braverman, Beidi Chen, Xia Hu

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conducted a comprehensive study on the element distribution in KV cache of popular LLMs. Our findings indicate that the key cache should be quantized per-channel, i.e., group elements along the channel dimension and quantize them together. In contrast, the value cache should be quantized per-token. From this analysis, we developed a tuning-free 2bit KV cache quantization algorithm, named KIVI. With the hardware-friendly implementation, KIVI can enable Llama (Llama-2), Falcon, and Mistral models to maintain almost the same quality while using 2.6 less peak memory usage (including the model weight). This reduction in memory usage enables up to 4 larger batch size, bringing 2.35 3.47 throughput on real LLM inference workload. The source code is available at https://github.com/jy-yuan/KIVI.
Researcher Affiliation	Academia	1Department of Computer Science, Rice University 2Department of Computer Science, Texas A&M University 3Department of Computer Science, Stevens Institute of Technology 4Department of Electrical and Computer Engineering, Carnegie Mellon University.
Pseudocode	Yes	Algorithm 1: The KIVI Prefill & Decoding Algorithm
Open Source Code	Yes	The source code is available at https://github.com/jy-yuan/KIVI.
Open Datasets	Yes	Specifically, we adopt generation tasks from LM-Eval (Gao et al., 2021) for normal context length evaluation and Long Bench (Bai et al., 2023) for long context evaluation, respectively.
Dataset Splits	No	The paper does not explicitly provide specific training/test/validation dataset splits (e.g., percentages, sample counts, or citations to predefined splits) for reproducibility.
Hardware Specification	Yes	The hardware here is a single NVIDIA A100 GPU (80GB).
Software Dependencies	No	The paper mentions using the "Hugging Face Transformers codebase" and implements kernels in "Triton" and "CUDA" but does not specify exact version numbers for these software dependencies.
Experiment Setup	Yes	Following previous work (Sheng et al., 2023), the group size G in Algorithm 1 for quantization is set as 32 across all experiments, the residual length R for key and value cache is set to 128.