Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

AttentionPredictor: Temporal Patterns Matter for KV Cache Compression

Authors: Qingyue Yang, Jie Wang, Xing Li, Zhihai Wang, Chen Chen, Lei Chen, Xianzhi Yu, Wulong Liu, Jianye Hao, Mingxuan Yuan, Bin Li

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments demonstrate that our approach achieves 13 KV cache compression and 5.6 speedup in a cache offloading scenario with comparable LLM performance, significantly outperforming the stateof-the-arts. The code is available at https://github.com/MIRALab-USTC/LLMAttention Predictor. ... We evaluate our method on various long-context tasks in the Long Bench benchmark, with the KV cache budgets ranging from 1024 to 4096. ... As shown in Table 1, Attention Predictor surpasses the performance of all SOTA KV cache eviction and retrieval methods across various KV budgets and LLMs. ... As shown in Figure 4, Attention Predictor consistently outperforms both baselines. ... We evaluate the prediction accuracy to evaluate our predictor, and the metric is Rpred rec /Rtarget rec . On the three representative tasks QA, summary, and mathematical reasoning with different KV cache budgets, Attention Predictor consistently achieves a higher average recovery rate compared to H2O and Quest, as shown in Table 3.
Researcher Affiliation Collaboration 1Mo E Key Laboratory of Brain-inspired Intelligent Perception and Cognition, University of Science and Technology of China 2Huawei Noah s Ark Lab 3College of Intelligence and Computing, Tianjin University EMAIL, EMAIL, EMAIL
Pseudocode Yes Algorithm 1 Identify Critical Tokens Input: Attention scores At, Attention history series AH, Block size b, KV budget B Output: Critical KV token indices set S
Open Source Code Yes The code is available at https://github.com/MIRALab-USTC/LLMAttention Predictor.
Open Datasets Yes We use the Long Bench [45], Infinite Bench [46] and RULER QA [47] dataset for long context evaluation. ... We employ the AIME [48] dataset... We use GSM8K math dataset... We use MMLU [49] and GPQA [50] dataset... We use the Needle In A Haystack [51] experiment... Benchmarks: Long Bench [45]: License, AIME 2024 [48]: License, GSM8K [60]: License, Needle In A Haystack [51]: License, Infinite Bench [46]: License.
Dataset Splits Yes To simulate Co T tasks within long-context, we increased the number of few-shot examples. Specifically, we randomly selected a fixed number of questions and standard Co T answer pairs as prompts, along with the questions to be tested. We chose 25, 47, and 97 few-shot examples, resulting in input lengths of approximately 4K, 8K, and 16K tokens respectively. ... The test set is a 20% split from the datasets.
Hardware Specification Yes We conducted experiments on NVIDIA A800 (80GB) GPUs.
Software Dependencies No The paper mentions using "Flash Attention2" and "Hugging Face library" without specifying their version numbers. No other software dependencies are listed with specific versions.
Experiment Setup Yes We set the history step H to 64, the block size b to 16, and the calibration step M to 5. Performance analysis of these hyperparameters is discussed in Section 4.6. Follow Quest [31], we did not apply our method or any other algorithms to the first two layers of the LLM. Following the settings of H2O and Streaming LLM, We allocated the budget equally to the prefix and local tokens, assigning 64 tokens each.