Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

KVzip: Query-Agnostic KV Cache Compression with Context Reconstruction

Authors: Jang-Hyun Kim, Jinuk Kim, Sangwoo Kwon, Jae W. Lee, Sangdoo Yun, Hyun Oh Song

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive empirical evaluations demonstrate that KVzip reduces KV cache size by 394 and Flash Attention decoding latency by approximately 2 , with negligible performance loss in question-answering, retrieval, reasoning, and code comprehension tasks. Evaluations include various models such as LLa MA3.1, Qwen2.5, and Gemma3, with context lengths reaching up to 170K tokens.
Researcher Affiliation	Collaboration	1Seoul National University, 2Neural Processing Research Center, 3NAVER AI Lab EMAIL
Pseudocode	Yes	Pseudo Code. Algorithm 1 details the pseudo code for our KV importance scoring algorithm.
Open Source Code	Yes	https://github.com/snu-mllab/KVzip
Open Datasets	Yes	Evaluations span diverse datasets: SQu AD [47], GSM8K [12], needle-in-a-haystack (NIAH) [26], and nine tasks from SCBench [35].
Dataset Splits	No	Evaluations span diverse datasets: SQu AD [47], GSM8K [12], needle-in-a-haystack (NIAH) [26], and nine tasks from SCBench [35]. SCBench provides comprehensive multi-query evaluations, including tasks from RULER [23] and Bench [59].
Hardware Specification	Yes	Empirical evaluations on an NVIDIA A100 GPU in Figure 8 confirm approximately twice the computational overhead of standard prefill during compression, with minimal additional memory (under 2%).
Software Dependencies	No	Additionally, we propose a softmax-free variant in Appendix C.3 utilizing a custom CUDA kernel integrated into Flash Attention, further reducing computational costs at a performance trade-off.
Experiment Setup	Yes	We employ a non-uniform head-budget allocation strategy for KV eviction, retaining KV pairs with the top r% importance scores across all attention heads, where r% denotes the target compression ratio. KV pairs of the initial system prompt remain intact. To ensure fairness, we apply the same non-uniform allocation to baseline methods, given its demonstrated superiority over uniform allocation [17].