MiniCache: KV Cache Compression in Depth Dimension for Large Language Models
Authors: Akide Liu, Jing Liu, Zizheng Pan, Yefei He, Reza Haffari, Bohan Zhuang
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct a comprehensive evaluation of Mini Cache utilizing various models including LLa MA-2, LLa MA-3, Phi-3, Mistral, and Mixtral across multiple benchmarks, demonstrating its exceptional performance in achieving superior compression ratios and high throughput. |
| Researcher Affiliation | Academia | Akide Liu1 Jing Liu1 Zizheng Pan1 Yefei He2 Gholamreza Haffari1 Bohan Zhuang1,2 1ZIP Lab, Monash University, Australia 2ZIP Lab, Zhejiang University, China |
| Pseudocode | Yes | Algorithm 1: The Mini Cache Inference Algorithm; Algorithm 2: The Mini Cache Prefill & Decoding Compression Algorithm |
| Open Source Code | Yes | Project is available at https://minicache.vmv.re . |
| Open Datasets | Yes | We conduct extensive experiments with representative LLMs, including Mixtral-8x7B [22], Phi3-Mini [23], and LLa MA-3 [6] 8B and 70B, respectively. Our method is benchmarked across a diverse range of question answering and generation datasets [24, 25, 26, 27, 28, 29, 30, 31] using the lm-eval-harness [32]. Additionally, we evaluate our results on Long Bench [33] for long-sequence generation. |
| Dataset Splits | Yes | Based on LLa MA-3-70B [6], we conduct zero-shot inference on the validation sets of three widely recognized benchmarks: COQA [71], GSM8K [10] and Truthful QA [72]. |
| Hardware Specification | Yes | For sequential loading of large models, we utilize NVIDIA 4 A100 80GB GPUs, more details refers to Appendix D. ... Using the LLa MA-2-7B model on a single 80GB NVIDIA A100 GPU, we benchmark our method in a batch-serving scenario... |
| Software Dependencies | No | Our method is benchmarked across a diverse range of question answering and generation datasets [24, 25, 26, 27, 28, 29, 30, 31] using the lm-eval-harness [32]. |
| Experiment Setup | Yes | For the proposed Mini Cache, we set the interpolation parameter t to 0.6, indicating that the merged results have a smaller rotation angle to the next layer. Furthermore, we set the token retention threshold γ to 0.05, according to the statistics of unmergeable tokens across multiple datasets. |