Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
CAKE: Cascading and Adaptive KV Cache Eviction with Layer Preferences
Authors: Ziran Qin, Yuchen Cao, Mingbao Lin, Wen Hu, Shixuan Fan, Ke Cheng, Weiyao Lin, Jianguo Li
ICLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Comprehensive experiments on Long Bench and Needle Bench show that CAKE maintains model performance with only 3.2% of the KV cache and consistently outperforms current baselines across various models and memory constraints, particularly in low-memory settings. Additionally, CAKE achieves over 10 speedup in decoding latency compared to full cache when processing contexts of 128K tokens with Flash Attention2. |
| Researcher Affiliation | Collaboration | 1Shanghai Jiao Tong University, 2Ant Group, 3Independent Researcher |
| Pseudocode | Yes | Algorithm 1 outlines our cache management procedure, which partitions the prefilling process into L stages, one for each model layer, and cascadingly manages cache memory guided by preference scores. ... Listing 1 provides Py Torch-style pseudo-code for CAKE s implementation with Flash Attention. |
| Open Source Code | Yes | Our code is available at https://github.com/antgroup/cakekv. |
| Open Datasets | Yes | To evaluate CAKE s performance across various memory budgets, we use two carefully designed benchmarks: (1) Long Bench (Bai et al., 2023): Focuses on long-context understanding, encompassing 16 datasets in six categories: Single/Multi-Document QA, Summarization, Few-shot Learning, Synthetic Tasks, and Code Completion. (2) Needle Bench (Li et al., 2024a): Tests retrieval and reasoning in complex contexts through three subtasks: Single-Needle Retrieval, Multi-Needle Retrieval, and Multi-Needle Reasoning. |
| Dataset Splits | Yes | To evaluate CAKE s performance across various memory budgets, we use two carefully designed benchmarks: (1) Long Bench (Bai et al., 2023)... (2) Needle Bench (Li et al., 2024a)... All experiments are conducted rigorously following the original Needle Bench protocol: Levenshtein distance is used to measure the similarity between predictions and references. Each case is repeated ten times to ensure stable scores, and the results are weighted-averaged to obtain an overall score, providing a balanced representation of each task. |
| Hardware Specification | Yes | Experiments are run on NVIDIA A100 80GB GPUs. |
| Software Dependencies | No | Section C 'MORE IMPLEMENTATION DETAILS' mentions 'Flash Attention-2 (Dao, 2023)' and 'Py Torch-style pseudo-code'. However, specific version numbers for PyTorch or Flash Attention-2 are not provided, preventing a reproducible description of ancillary software. |
| Experiment Setup | Yes | For the preference-prioritized adaptive allocation strategy, the temperature parameters τ1 and τ2 are set to 0.2 2 and 0.4 3 respectively, optimized through grid search. For our experimental models, we aggregate each quantized value to represent its layer characteristics. For the attention-shift tolerant eviction indicator, we fix γ = 200. Following Snap KV (Li et al., 2024b), we use an observation window of size Sw = 32 and utilize a pooling layer clustering information. |