Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

CAKE: Cascading and Adaptive KV Cache Eviction with Layer Preferences

Authors: Ziran Qin, Yuchen Cao, Mingbao Lin, Wen Hu, Shixuan Fan, Ke Cheng, Weiyao Lin, Jianguo Li

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Comprehensive experiments on Long Bench and Needle Bench show that CAKE maintains model performance with only 3.2% of the KV cache and consistently outperforms current baselines across various models and memory constraints, particularly in low-memory settings. Additionally, CAKE achieves over 10 speedup in decoding latency compared to full cache when processing contexts of 128K tokens with Flash Attention2.
Researcher Affiliation Collaboration 1Shanghai Jiao Tong University, 2Ant Group, 3Independent Researcher
Pseudocode Yes Algorithm 1 outlines our cache management procedure, which partitions the prefilling process into L stages, one for each model layer, and cascadingly manages cache memory guided by preference scores. ... Listing 1 provides Py Torch-style pseudo-code for CAKE s implementation with Flash Attention.
Open Source Code Yes Our code is available at https://github.com/antgroup/cakekv.
Open Datasets Yes To evaluate CAKE s performance across various memory budgets, we use two carefully designed benchmarks: (1) Long Bench (Bai et al., 2023): Focuses on long-context understanding, encompassing 16 datasets in six categories: Single/Multi-Document QA, Summarization, Few-shot Learning, Synthetic Tasks, and Code Completion. (2) Needle Bench (Li et al., 2024a): Tests retrieval and reasoning in complex contexts through three subtasks: Single-Needle Retrieval, Multi-Needle Retrieval, and Multi-Needle Reasoning.
Dataset Splits Yes To evaluate CAKE s performance across various memory budgets, we use two carefully designed benchmarks: (1) Long Bench (Bai et al., 2023)... (2) Needle Bench (Li et al., 2024a)... All experiments are conducted rigorously following the original Needle Bench protocol: Levenshtein distance is used to measure the similarity between predictions and references. Each case is repeated ten times to ensure stable scores, and the results are weighted-averaged to obtain an overall score, providing a balanced representation of each task.
Hardware Specification Yes Experiments are run on NVIDIA A100 80GB GPUs.
Software Dependencies No Section C 'MORE IMPLEMENTATION DETAILS' mentions 'Flash Attention-2 (Dao, 2023)' and 'Py Torch-style pseudo-code'. However, specific version numbers for PyTorch or Flash Attention-2 are not provided, preventing a reproducible description of ancillary software.
Experiment Setup Yes For the preference-prioritized adaptive allocation strategy, the temperature parameters τ1 and τ2 are set to 0.2 2 and 0.4 3 respectively, optimized through grid search. For our experimental models, we aggregate each quantized value to represent its layer characteristics. For the attention-shift tolerant eviction indicator, we fix γ = 200. Following Snap KV (Li et al., 2024b), we use an observation window of size Sw = 32 and utilize a pooling layer clustering information.