Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache

Authors: Zirui Liu, Jiayi Yuan, Hongye Jin, Shaochen Zhong, Zhaozhuo Xu, Vladimir Braverman, Beidi Chen, Xia Hu

ICML 2024 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conducted a comprehensive study on the element distribution in KV cache of popular LLMs. Our findings indicate that the key cache should be quantized per-channel, i.e., group elements along the channel dimension and quantize them together. In contrast, the value cache should be quantized per-token. From this analysis, we developed a tuning-free 2bit KV cache quantization algorithm, named KIVI. With the hardware-friendly implementation, KIVI can enable Llama (Llama-2), Falcon, and Mistral models to maintain almost the same quality while using 2.6 less peak memory usage (including the model weight). This reduction in memory usage enables up to 4 larger batch size, bringing 2.35 3.47 throughput on real LLM inference workload. The source code is available at https://github.com/jy-yuan/KIVI.
Researcher Affiliation Academia 1Department of Computer Science, Rice University 2Department of Computer Science, Texas A&M University 3Department of Computer Science, Stevens Institute of Technology 4Department of Electrical and Computer Engineering, Carnegie Mellon University.
Pseudocode Yes Algorithm 1: The KIVI Prefill & Decoding Algorithm
Open Source Code Yes The source code is available at https://github.com/jy-yuan/KIVI.
Open Datasets Yes Specifically, we adopt generation tasks from LM-Eval (Gao et al., 2021) for normal context length evaluation and Long Bench (Bai et al., 2023) for long context evaluation, respectively.
Dataset Splits No The paper does not explicitly provide specific training/test/validation dataset splits (e.g., percentages, sample counts, or citations to predefined splits) for reproducibility.
Hardware Specification Yes The hardware here is a single NVIDIA A100 GPU (80GB).
Software Dependencies No The paper mentions using the "Hugging Face Transformers codebase" and implements kernels in "Triton" and "CUDA" but does not specify exact version numbers for these software dependencies.
Experiment Setup Yes Following previous work (Sheng et al., 2023), the group size G in Algorithm 1 for quantization is set as 32 across all experiments, the residual length R for key and value cache is set to 128.