Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Compute or Load KV Cache? Why Not Both?

Authors: Shuowei Jin, Xueshen Liu, Qingzhao Zhang, Zhuoqing Mao

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Through extensive evaluations across various hardware configurations, datasets, and storage conditions, Cake achieves on average 2.6 reduction in Time to First Token (TTFT) compared to compute-only and I/O-only methods.
Researcher Affiliation	Academia	1University of Michigan. Correspondence to: Shuowei Jin <EMAIL>, Xueshen Liu <EMAIL>.
Pseudocode	Yes	Algorithm 1 Cake Bidirectional KV cache Loading Algorithm
Open Source Code	No	We implement Cake by extending LMCache (LMCache, 2024) and integrating it with v LLM (Kwon et al., 2023), adding approximately 1,000 lines of code. The paper mentions extending a third-party tool (LMCache) but does not provide specific access to the code for Cake itself.
Open Datasets	Yes	We evaluate Cake across various context lengths using three datasets with different task types: Long Chat (Li et al., 2023) for multi-turn conversations, and Trivia QA and Narrative QA (Bai et al., 2023) for long-document questionanswering tasks.
Dataset Splits	No	Since specific token values do not impact Cake s performance evaluation (only token length matters), we generate synthetic prompts by uniformly sampling token lengths every 2k tokens within this range. The paper mentions datasets and synthetic prompt generation, but does not specify how the data was split into training, validation, or test sets for the named datasets.
Hardware Specification	Yes	We run our evaluation on two server configurations: 1) A server equipped with two NVIDIA A100 80GB GPUs connected via NVLink, a 64-core AMD EPYC 7763 CPU, and 2.0TB of memory. 2) A server with a single NVIDIA H100 GPU, a 26-core v CPU, and 200GB of memory.
Software Dependencies	Yes	We use v LLM (v0.6.2) in chunk prefill mode with token budget sizes of 512 by default.
Experiment Setup	Yes	Models. We evaluate Cake on various long-context models with different architectures and sizes, including Long Alpaca-7B and Long Alpaca-13B (Chen et al., 2023), which are based on LLa MA 2, as well as LLa MA 3.18B and LLa MA 3.1-70B. Due to hardware constraints, we use the FP8-weight version for LLa MA 3.1-70B, while all other models use FP16 weights. The first two are multihead attention (MHA) models, while the last two apply group query attention (GQA), which will introduce different computation vs. memory. We use BF16 as the default kv cache data type.