Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

KVzip: Query-Agnostic KV Cache Compression with Context Reconstruction

Authors: Jang-Hyun Kim, Jinuk Kim, Sangwoo Kwon, Jae W. Lee, Sangdoo Yun, Hyun Oh Song

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive empirical evaluations demonstrate that KVzip reduces KV cache size by 394 and Flash Attention decoding latency by approximately 2 , with negligible performance loss in question-answering, retrieval, reasoning, and code comprehension tasks. Evaluations include various models such as LLa MA3.1, Qwen2.5, and Gemma3, with context lengths reaching up to 170K tokens.
Researcher Affiliation Collaboration 1Seoul National University, 2Neural Processing Research Center, 3NAVER AI Lab EMAIL
Pseudocode Yes Pseudo Code. Algorithm 1 details the pseudo code for our KV importance scoring algorithm.
Open Source Code Yes https://github.com/snu-mllab/KVzip
Open Datasets Yes Evaluations span diverse datasets: SQu AD [47], GSM8K [12], needle-in-a-haystack (NIAH) [26], and nine tasks from SCBench [35].
Dataset Splits No Evaluations span diverse datasets: SQu AD [47], GSM8K [12], needle-in-a-haystack (NIAH) [26], and nine tasks from SCBench [35]. SCBench provides comprehensive multi-query evaluations, including tasks from RULER [23] and Bench [59].
Hardware Specification Yes Empirical evaluations on an NVIDIA A100 GPU in Figure 8 confirm approximately twice the computational overhead of standard prefill during compression, with minimal additional memory (under 2%).
Software Dependencies No Additionally, we propose a softmax-free variant in Appendix C.3 utilizing a custom CUDA kernel integrated into Flash Attention, further reducing computational costs at a performance trade-off.
Experiment Setup Yes We employ a non-uniform head-budget allocation strategy for KV eviction, retaining KV pairs with the top r% importance scores across all attention heads, where r% denotes the target compression ratio. KV pairs of the initial system prompt remain intact. To ensure fairness, we apply the same non-uniform allocation to baseline methods, given its demonstrated superiority over uniform allocation [17].