Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

SmartCache: Context-aware Semantic Cache for Efficient Multi-turn LLM Inference

Authors: Chengye Yu, Tianyu Wang, Zili Shao, Song Jiang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The evaluation demonstrates Smart Cache s effectiveness across multiple benchmarks. On the Co QA and SQu AD datasets, Smart Cache reduces KV cache memory usage by up to 59.1% compared to prefix caching and 56.0% over semantic caching, while cutting Time-to-First-Token (TTFT) by 78.0% and 71.7%, respectively. It improves answer quality metrics, achieving 39.9% higher F1 and 39.1% higher ROUGE-L for Qwen-2.5-1.5B on Co QA. The Semantic-aware Tiered Eviction Policy (STEP) outperforms LRU/LFU by 29.9% in reuse distance under skewed workloads.
Researcher Affiliation Academia Chengye Yu Chinese University of Hong Kong Hong Kong, China EMAIL Tianyu Wang B Shenzhen University Shenzhen, China EMAIL Zili Shao Chinese University of Hong Kong Hong Kong, China EMAIL Song Jiang University of Texas at Arlington Arlington, USA EMAIL
Pseudocode Yes Algorithm 1 Cache Operations Require: Session s, New Query qnew, Semantic Forest F (access to Iglobal and local IN ), Similarity threshold τsim, Embedding Model E( ). Ensure: Boolean is_hit, Node Nhit, Response acached 1: function CACHELOOKUP(s, qnew, F, τsim)
Open Source Code No The paper mentions using Py Torch 2.3, BGE-M3, and faiss, but does not provide a statement or link for the open-source code of their own methodology (Smart Cache).
Open Datasets Yes Dataset. Three datasets are used in the evaluation, including the Co QA dev dataset [33] and SQu AD2.0 dev dataset [31]. Co QA is a conversational question answering dataset where questions are asked sequentially on the same short story. Later questions often depend on conversation history. SQu AD (Stanford Question Answering Dataset) contains questions on Wikipedia pages. Each story or paragraph used in the experiments has one original conversation sessions and two similar conversation sessions, with each session consisting of on average 5 turns of progressive question and answers.
Dataset Splits Yes Dataset. Three datasets are used in the evaluation, including the Co QA dev dataset [33] and SQu AD2.0 dev dataset [31].
Hardware Specification Yes Harward settings. We use a server equipped with Intel Xeon Silver 4310 CPU [17] and NVIDIA RTX4090 GPU [26], connected through PCIe4.0 16. The host has 256GB DDR4 memory and GPU has 24GB GDDR6X memory. Our server runs Linux 5.4 with CUDA 12.0 [5]. We implement our method using Py Torch 2.3 [29]. We evaluate our method on three different-sized open-source LLMs: Qwen-2.5-1.5B-Instruct [43], Llama-3.1-8B-Instruct [16], and Mistral-7B-Instruct-v0.2 [3].
Software Dependencies Yes Our server runs Linux 5.4 with CUDA 12.0 [5]. We implement our method using Py Torch 2.3 [29]. ... BGE-M3 [7] is used as the embedding model with the dimension of 1024. Embedding vectors are stored and searched using faiss [11] vector index based on L2 distance.
Experiment Setup Yes The KV cache block size is 16 tokens. BGE-M3 [7] is used as the embedding model with the dimension of 1024. Embedding vectors are stored and searched using faiss [11] vector index based on L2 distance. The similarity threshold is set to 0.75. ... We adopt α = 0.6 and β = 0.4 for STEP in all evaluations, as this setting attains the highest average reuse distance.