Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Learned Prefix Caching for Efficient LLM Inference

Authors: Dongsheng Yang, Austin Li, Kai Li, Wyatt Lloyd

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive evaluations across three real-world datasets demonstrate that LPC achieves 18 47% reductions in required cache sizes for equivalent hit ratios and has an 11% improvement in LLM prefilling throughput in an emulated environment.
Researcher Affiliation Academia Dongsheng Yang Austin Li Kai Li Wyatt Lloyd Department of Computer Science Princeton University EMAIL
Pseudocode No Section 3.4 describes the eviction algorithm operations (Insertion, Promotion, Probability update, Eviction) in prose, detailing the steps, but does not present them in a structured pseudocode block or algorithm figure.
Open Source Code Yes Our contributions are as follows: (3) Our evaluations of an implemented prototype system shows that LPC consistently outperforms LRU, achieving 18 47% reductions in the required cache size to achieve the same hit ratio. In an emulated disaggregated serving [Zhong et al., 2024] and reasoning [Snell et al., 2024] environment, this translates to LPC delivering up to 11% higher prefilling throughput and reducing the first token latency of up to 7% of requests by 42 75%. Furthermore, as a system-level cache management policy, LPC is orthogonal to and can be combined with other optimizations like Flash Attention or speculative decoding. The implementation can be found at https://github.com/yangdsh/LPC.
Open Datasets Yes We run prefix cache hit ratio comparisons on real-world conversational datasets. These include: LMSys: a large-scale collection of user-chatbot dialogues [Zheng et al., 2023a]. Share GPT: conversations with Chat GPT [Share GPT Team, 2023]. Chatbot-Arena: a platform where users compare various LLMs [Chiang et al., 2024].
Dataset Splits Yes Training and validation data are collected from a dedicated partition of conversations from the target dataset (as described in Section 4.2). This partition is strictly isolated and excluded from any datasets used for online evaluation. The size of the partition is half of the dataset.
Hardware Specification Yes All experiments were conducted on NVIDIA H100 GPUs, equipped with 80 GB of HBM3 memory. We use the Qwen3-32B-FP8 model [Qwen3 Team, 2025] to run inference. It is a 32-billion parameter reasoning model and is licensed with Apache 2.0. We use a single GPU to run the model. Considering the memory used by the model and pytorch, the memory available for KV cache and Prefix cache is 40 GB in total. The other hardware usage includes 8 CPUs and 64 GB CPU memory.
Software Dependencies Yes We implement a prototype of LPC on top of the v LLM serving framework [Kwon et al., 2023], which is licensed under Apache-2.0. The implementation is based on the main branch of v LLM as of March 10, 2025.
Experiment Setup Yes Training configurations. The training works as follows. First, during training, only the weights of the MLP classifier are updated; the pre-trained text embedding model is kept frozen. This partial fine-tuning method allows the MLP to adapt specifically to the continuation prediction task while ensuring the framework remains lightweight. Full fine-tuning would be operationally expensive for frequent retraining, whereas training only the tiny MLP head is extremely fast (typically less than 10 minutes), making daily adaptation feasible. Second, text embeddings for the training dataset are precomputed and cached after the first epoch to significantly accelerate training. Subsequent epochs directly reuse these cached embeddings because the text embedding model is frozen, bypassing repeated computations and saving 90% of the training time. A binary cross-entropy loss function is employed. The loss is weighted with more weights for the minority class to handle class imbalance. The optimizer is Adam with a learning rate of 5 10 4. The training runs until convergence (within 20 epochs in our evaluation), and the checkpoint with the lowest loss on the validation dataset is saved for running online inference.