Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache
Authors: Zirui Liu, Jiayi Yuan, Hongye Jin, Shaochen Zhong, Zhaozhuo Xu, Vladimir Braverman, Beidi Chen, Xia Hu
ICML 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conducted a comprehensive study on the element distribution in KV cache of popular LLMs. Our findings indicate that the key cache should be quantized per-channel, i.e., group elements along the channel dimension and quantize them together. In contrast, the value cache should be quantized per-token. From this analysis, we developed a tuning-free 2bit KV cache quantization algorithm, named KIVI. With the hardware-friendly implementation, KIVI can enable Llama (Llama-2), Falcon, and Mistral models to maintain almost the same quality while using 2.6 less peak memory usage (including the model weight). This reduction in memory usage enables up to 4 larger batch size, bringing 2.35 3.47 throughput on real LLM inference workload. The source code is available at https://github.com/jy-yuan/KIVI. |
| Researcher Affiliation | Academia | 1Department of Computer Science, Rice University 2Department of Computer Science, Texas A&M University 3Department of Computer Science, Stevens Institute of Technology 4Department of Electrical and Computer Engineering, Carnegie Mellon University. |
| Pseudocode | Yes | Algorithm 1: The KIVI Prefill & Decoding Algorithm |
| Open Source Code | Yes | The source code is available at https://github.com/jy-yuan/KIVI. |
| Open Datasets | Yes | Specifically, we adopt generation tasks from LM-Eval (Gao et al., 2021) for normal context length evaluation and Long Bench (Bai et al., 2023) for long context evaluation, respectively. |
| Dataset Splits | No | The paper does not explicitly provide specific training/test/validation dataset splits (e.g., percentages, sample counts, or citations to predefined splits) for reproducibility. |
| Hardware Specification | Yes | The hardware here is a single NVIDIA A100 GPU (80GB). |
| Software Dependencies | No | The paper mentions using the "Hugging Face Transformers codebase" and implements kernels in "Triton" and "CUDA" but does not specify exact version numbers for these software dependencies. |
| Experiment Setup | Yes | Following previous work (Sheng et al., 2023), the group size G in Algorithm 1 for quantization is set as 32 across all experiments, the residual length R for key and value cache is set to 128. |