Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Compute or Load KV Cache? Why Not Both?
Authors: Shuowei Jin, Xueshen Liu, Qingzhao Zhang, Zhuoqing Mao
ICML 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through extensive evaluations across various hardware configurations, datasets, and storage conditions, Cake achieves on average 2.6 reduction in Time to First Token (TTFT) compared to compute-only and I/O-only methods. |
| Researcher Affiliation | Academia | 1University of Michigan. Correspondence to: Shuowei Jin <EMAIL>, Xueshen Liu <EMAIL>. |
| Pseudocode | Yes | Algorithm 1 Cake Bidirectional KV cache Loading Algorithm |
| Open Source Code | No | We implement Cake by extending LMCache (LMCache, 2024) and integrating it with v LLM (Kwon et al., 2023), adding approximately 1,000 lines of code. The paper mentions extending a third-party tool (LMCache) but does not provide specific access to the code for Cake itself. |
| Open Datasets | Yes | We evaluate Cake across various context lengths using three datasets with different task types: Long Chat (Li et al., 2023) for multi-turn conversations, and Trivia QA and Narrative QA (Bai et al., 2023) for long-document questionanswering tasks. |
| Dataset Splits | No | Since specific token values do not impact Cake s performance evaluation (only token length matters), we generate synthetic prompts by uniformly sampling token lengths every 2k tokens within this range. The paper mentions datasets and synthetic prompt generation, but does not specify how the data was split into training, validation, or test sets for the named datasets. |
| Hardware Specification | Yes | We run our evaluation on two server configurations: 1) A server equipped with two NVIDIA A100 80GB GPUs connected via NVLink, a 64-core AMD EPYC 7763 CPU, and 2.0TB of memory. 2) A server with a single NVIDIA H100 GPU, a 26-core v CPU, and 200GB of memory. |
| Software Dependencies | Yes | We use v LLM (v0.6.2) in chunk prefill mode with token budget sizes of 512 by default. |
| Experiment Setup | Yes | Models. We evaluate Cake on various long-context models with different architectures and sizes, including Long Alpaca-7B and Long Alpaca-13B (Chen et al., 2023), which are based on LLa MA 2, as well as LLa MA 3.18B and LLa MA 3.1-70B. Due to hardware constraints, we use the FP8-weight version for LLa MA 3.1-70B, while all other models use FP16 weights. The first two are multihead attention (MHA) models, while the last two apply group query attention (GQA), which will introduce different computation vs. memory. We use BF16 as the default kv cache data type. |