Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Learned Prefix Caching for Efficient LLM Inference

Authors: Dongsheng Yang, Austin Li, Kai Li, Wyatt Lloyd

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive evaluations across three real-world datasets demonstrate that LPC achieves 18 47% reductions in required cache sizes for equivalent hit ratios and has an 11% improvement in LLM preﬁlling throughput in an emulated environment.
Researcher Affiliation	Academia	Dongsheng Yang Austin Li Kai Li Wyatt Lloyd Department of Computer Science Princeton University EMAIL
Pseudocode	No	Section 3.4 describes the eviction algorithm operations (Insertion, Promotion, Probability update, Eviction) in prose, detailing the steps, but does not present them in a structured pseudocode block or algorithm figure.
Open Source Code	Yes	Our contributions are as follows: (3) Our evaluations of an implemented prototype system shows that LPC consistently outperforms LRU, achieving 18 47% reductions in the required cache size to achieve the same hit ratio. In an emulated disaggregated serving [Zhong et al., 2024] and reasoning [Snell et al., 2024] environment, this translates to LPC delivering up to 11% higher preﬁlling throughput and reducing the ﬁrst token latency of up to 7% of requests by 42 75%. Furthermore, as a system-level cache management policy, LPC is orthogonal to and can be combined with other optimizations like Flash Attention or speculative decoding. The implementation can be found at https://github.com/yangdsh/LPC.
Open Datasets	Yes	We run preﬁx cache hit ratio comparisons on real-world conversational datasets. These include: LMSys: a large-scale collection of user-chatbot dialogues [Zheng et al., 2023a]. Share GPT: conversations with Chat GPT [Share GPT Team, 2023]. Chatbot-Arena: a platform where users compare various LLMs [Chiang et al., 2024].
Dataset Splits	Yes	Training and validation data are collected from a dedicated partition of conversations from the target dataset (as described in Section 4.2). This partition is strictly isolated and excluded from any datasets used for online evaluation. The size of the partition is half of the dataset.
Hardware Specification	Yes	All experiments were conducted on NVIDIA H100 GPUs, equipped with 80 GB of HBM3 memory. We use the Qwen3-32B-FP8 model [Qwen3 Team, 2025] to run inference. It is a 32-billion parameter reasoning model and is licensed with Apache 2.0. We use a single GPU to run the model. Considering the memory used by the model and pytorch, the memory available for KV cache and Preﬁx cache is 40 GB in total. The other hardware usage includes 8 CPUs and 64 GB CPU memory.
Software Dependencies	Yes	We implement a prototype of LPC on top of the v LLM serving framework [Kwon et al., 2023], which is licensed under Apache-2.0. The implementation is based on the main branch of v LLM as of March 10, 2025.
Experiment Setup	Yes	Training conﬁgurations. The training works as follows. First, during training, only the weights of the MLP classiﬁer are updated; the pre-trained text embedding model is kept frozen. This partial ﬁne-tuning method allows the MLP to adapt speciﬁcally to the continuation prediction task while ensuring the framework remains lightweight. Full ﬁne-tuning would be operationally expensive for frequent retraining, whereas training only the tiny MLP head is extremely fast (typically less than 10 minutes), making daily adaptation feasible. Second, text embeddings for the training dataset are precomputed and cached after the ﬁrst epoch to signiﬁcantly accelerate training. Subsequent epochs directly reuse these cached embeddings because the text embedding model is frozen, bypassing repeated computations and saving 90% of the training time. A binary cross-entropy loss function is employed. The loss is weighted with more weights for the minority class to handle class imbalance. The optimizer is Adam with a learning rate of 5 10 4. The training runs until convergence (within 20 epochs in our evaluation), and the checkpoint with the lowest loss on the validation dataset is saved for running online inference.