Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Tail-Optimized Caching for LLM Inference

Authors: Wenxin Zhang, Yueying Li, Ciamac C Moallemi, Tianyi Peng

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimentally, on real conversation data Wild Chat [Zhao et al., 2024], Tail-Optimized LRU achieves up to 27.5% reduction in P90 tail Time to First Token latency and 23.9% in P95 tail latency compared to LRU, along with up to 38.9% decrease in SLO violations of 200ms.
Researcher Affiliation Academia Wenxin Zhang Columbia Business School EMAIL Yueying Li Cornell University, Department of Computer Science EMAIL Ciamac C. Moallemi Columbia Business School EMAIL Tianyi Peng Columbia Business School EMAIL
Pseudocode Yes Algorithm 1: Tail-Optimized LRU Policy Input: Number of conversations N, Timestamp of Last Turn {τi}, Number of cached blocks {Xi}, conversation history lengths {Li}, arriving conversation θ, arriving conversation length L θ Parameters: Policy parameters: threshold ξ, next-turn length estimate { ˆQi} Output: Updated cache sizes {Xi}
Open Source Code Yes Justification: We perform experiments on public datasets and provide all codes in the submission. ... Justification: We would submit and release the code.
Open Datasets Yes Finally, we evaluate the performance of T-LRU and LRU on real multi-turn chat traces from Share GPT [Contributors, 2025] and Wild Chat [Zhao et al., 2024].
Dataset Splits No For each dataset, we select conversations based on their arrival timestamps and extract the first 1000 2000 turns across these conversations. Specifically, we sample conversations in chronological order (by first-turn timestamp), split each conversation into individual turns, and simulate their arrivals following the observed timestamps in the trace. ... Share GPT [Contributors, 2025] does not include timestamps of each request, thus we generate them with the stochastic model described in Section 4.
Hardware Specification Yes Our default experimental configuration uses Vicuna-7B served on a single A100 GPU via v LLM with tensor parallelism disabled (TP = 1) and without mixed batching. ... We used A100 GPU to estimate the default latency function.
Software Dependencies No Our default experimental configuration uses Vicuna-7B served on a single A100 GPU via v LLM with tensor parallelism disabled (TP = 1) and without mixed batching.
Experiment Setup Yes Our default experimental configuration uses Vicuna-7B served on a single A100 GPU via v LLM with tensor parallelism disabled (TP = 1) and without mixed batching. We fix the input parameter for T-LRU, next-prompt length ˆQ, to be the average prompt length (200 for Wild Chat, 150 for Share GPT), and use 1024 as the threshold for Threshold-LRU following the one used by Open AI. ... We set ξ = ξs/c using a simple linear fit between awaited prefill tokens and prefill time, which is easy to implement in practice.