Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
ChunkKV: Semantic-Preserving KV Cache Compression for Efficient Long-Context LLM Inference
Authors: Xiang Liu, Zhenheng Tang, Peijie Dong, Zeyu Li, Liuyue, Bo Li, Xuming Hu, Xiaowen Chu
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Comprehensive evaluations on challenging benchmarks: Long Bench, Needle-In A-Hay Stack, GSM8K, and Jailbreak V demonstrate that Chunk KV outperforms state-of-the-art methods by up to 8.7% in precision while maintaining the same compression ratio. These results confirm that semantic-aware compression significantly enhances both efficiency and performance for long-context LLM inference, providing a simple yet effective solution to the memory bottleneck problem. |
| Researcher Affiliation | Collaboration | Xiang LIU Zhenheng TANG Peijie DONG Zeyu LI Yue LIU Bo LI Xuming HU Xiaowen CHU The Hong Kong University of Science and Technology (Guangzhou) CSE, The Hong Kong University of Science and Technology Guangzhou HKUST Fok Ying Tung Research Institute Terminus Technologies EMAIL EMAIL EMAIL |
| Pseudocode | Yes | Algorithm 1 Chunk KV Input: Q RTq d, K RTk d, V RTv d, observe window size w, chunk size c, compressed KV cache max length Lmax Output: Compressed KV cache K , V Observe Window Calculation: A QTq w:Tq KT Attention scores for the observe window C Tk c Calculate the number of chunks Chunk Attention Score Calculation: for i = 1 to C do Ai Pic j=(i 1)c+1 A:,j Sum of attention scores for each chunk end for Top-K Chunk Selection: k Lmax Top_K_Indices indices of Top-k chunks based on Ai Compression: K , V index_select(K, V, Top_K_Indices) Concatenation: K concat(K 0:Lmax w, KTk w:Tk) V concat(V 0:Lmax w, VTv w:Tv) K , V Algorithm 2 Layer-wise Index Reuse for Chunk KV Input: Number of layers in LLMs Nlayers, number of reuse layers Nreuse Initialize: Dictionary to store indices Ireuse = {} for l = 0 to (Nlayers 1) do if l mod Nreuse == 0 then Kl, Vl, Il Chunk KV(Kl, Vl) Ireuse[l] Il else Il Ireuse[ j l Nreuse k Nreuse] end if K l index_select(Kl, Il) V l index_select(Vl, Il) end for |
| Open Source Code | No | The code is available at link. |
| Open Datasets | Yes | Comprehensive evaluations on challenging benchmarks: Long Bench, Needle-In A-Hay Stack, GSM8K, and Jailbreak V demonstrate that Chunk KV outperforms state-of-the-art methods by up to 8.7% in precision while maintaining the same compression ratio. Appendix J: Licenses: For the evaluation dataset, all the datasets, including, GSM8K [27], Long Bench [25] are released under MIT license. NIAH [26] is released under GPL-3.0 license. |
| Dataset Splits | Yes | Following Agarwal et al. [42], we consider many-shot GSM8K as a long-context reasoning scenario, which is a more challenging task than long-context retrieval benchmark Long Bench [25]. The Co T prompt settings for this experiment are the same as those used by Wei et al. [37], for many-shot GSM8K we set the number of shots to 50, where the prompt length is more than 4k tokens. For more details on the prompt settings, please refer to the APPENDIX G. Table 28: Dataset Statistics. # TRAIN and # TEST represent the number of training and test samples, respectively. *: The size of the NIAH test set varies based on the context length and step size, typically around 800 samples per evaluation. |
| Hardware Specification | Yes | We evaluated the latency and throughput of Chunk KV compared to Full KV using LLa MA3-8B-Instruct on an A40 GPU. |
| Software Dependencies | No | All experiments were conducted with reuse layer is 2, batch size set to 1 and inference was performed using Flash Attention 2, each experiment was repeated 10 times and the average latency and throughput were reported. Our Py Torch implementation further extends these optimizations by leveraging GPU acceleration for all vector and matrix operations. |
| Experiment Setup | Yes | In this section, we conduct experiments to evaluate the effectiveness of Chunk KV on KV cache compression in two benchmark fields, with a chunk size set to 10 even for various model architectures. All experiments were carried out three times, using the mean score to ensure robustness. For these experiments, we use the same configuration as our main Long Bench experiments in Section 4.2, with index reuse applied to consecutive layers (reuse layers = 2). We evaluated the latency and throughput of Chunk KV compared to Full KV using LLa MA3-8B-Instruct on an A40 GPU. All experiments were conducted with reuse layer is 2, batch size set to 1 and inference was performed using Flash Attention 2, each experiment was repeated 10 times and the average latency and throughput were reported. |