KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization
Authors: Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Michael W. Mahoney, Sophia Shao, Kurt Keutzer, Amir Gholami
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | By applying our method to the LLa MA, Llama-2, Llama-3, and Mistral models, we achieve < 0.1 perplexity degradation with 3-bit quantization on both Wikitext-2 and C4, outperforming existing approaches. Our method enables serving LLa MA-7B with a context length of up to 1 million on a single A100-80GB GPU and up to 10 million on an 8-GPU system. We develop custom CUDA kernels for KVQuant, showing that we can achieve up to 1.7 speedups, compared to baseline fp16 matrix-vector multiplications, for the LLa MA-7B model. |
| Researcher Affiliation | Academia | 1University of California, Berkeley 2ICSI 3LBNL {chooper, sehoonkim, hiva, mahoneymw, ysshao, keutzer, amirgh}@berkeley.edu |
| Pseudocode | No | The paper describes its methods textually and refers to kernel implementations, but it does not include any figures, blocks, or sections explicitly labeled 'Pseudocode' or 'Algorithm', nor does it present structured code-like steps for a procedure. |
| Open Source Code | Yes | Code is available at https://github.com/SqueezeAILab/KVQuant. |
| Open Datasets | Yes | We used the LLa MA-7B/13B/30B/65B, Llama-2-7B/13B/70B, Llama-3-8B/70B, and Mistral-7B models to evaluate our methodology by measuring perplexity on both Wikitext-2 and C4 [36, 37, 1, 16, 27, 31]. All the KVQuant models throughout this experiment section are calibrated using 16 calibration samples of sequence length 2K from the Wikitext-2 training set. |
| Dataset Splits | No | The paper mentions calibrating models using a training set and evaluating perplexity on datasets like Wikitext-2 and C4, but it does not explicitly specify train/test/validation splits or their percentages/counts for the main evaluation datasets. |
| Hardware Specification | Yes | Our method enables serving LLa MA-7B with a context length of up to 1 million on a single A100-80GB GPU and up to 10 million on an 8-GPU system. We report latency benchmarked on an A6000 GPU. The runtime is reported on a system with an A6000 GPU and an Intel Xeon Gold 6126 CPU. |
| Software Dependencies | No | The paper mentions using the 'Transformers library for LLa MA [40]' but does not provide specific version numbers for this or any other software dependencies, which are required for a reproducible description. |
| Experiment Setup | Yes | All the KVQuant models throughout this experiment section are calibrated using 16 calibration samples of sequence length 2K from the Wikitext-2 training set. We measured perplexity on both Wikitext-2 and on C4 using a sequence length equal to the maximum context length of the model (2K for LLa MA, 4K for Llama-2, and 8K for Llama-3 and Mistral-7B). We use an fp16 zeropoint rather than a low-precision zeropoint that is rounded to the nearest integer value. We set the number of threads such that there were 10 nonzero values assigned to each thread. |