Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
LLM Query Scheduling with Prefix Reuse and Latency Constraints
Authors: Gregory Dexter, Shao Tang, Ata Fatahi, Qingquan Song, Tejas Dharamsi, Aman Gupta
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirical evaluations in a realistic serving setting validates our findings, showing significant reductions in P99 TTFT compared to baseline methods. In this section, we measure the performance of k-LPM versus FCFS and LPM in a realistic setting. Our results validate the predictive power of the computational model from Section 2 and the data generative model from Definition 4 in real-world serving scenarios. |
| Researcher Affiliation | Industry | Gregory Dexter Linked In EMAIL Shao Tang Linked In EMAIL Ata Fatahibaarzi Linked In EMAIL Qingquan Song Linked In EMAIL Tejas Dharamsi Linked In EMAIL Aman Gupta Nubank EMAIL |
| Pseudocode | Yes | Algorithm 1 k-LPM 1: Input: Input queue of prompts and arrival times Q = (xi, ti) 2: while true do 3: Process the oldest query, i.e., xi such that i = argminj tj 4: for i = 1, . . . , k 1 do 5: Process x that maximizes the prefix cache hit rate 6: end for 7: end while |
| Open Source Code | No | Answer: [No] Justification: The algorithm can easily be reproduced as described above. We are not permitted to share the exact data we used. However, we ve described the relevant properties of the data and so the results should be reproducible by using other similar synthetic or real data sets. |
| Open Datasets | No | Answer: [No] Justification: The algorithm can easily be reproduced as described above. We are not permitted to share the exact data we used. However, we ve described the relevant properties of the data and so the results should be reproducible by using other similar synthetic or real data sets. |
| Dataset Splits | Yes | Finally, we construct the dataset used for benchmarking by sampling four prompts with shared user history from the 8k context length prompts described in Team [2025] for 2100 prompts in total. We then randomly shuffle the ordering of these prompts and use 100 for warm up of the benchmarking server. |
| Hardware Specification | Yes | In our experiments, we use the Llama-3.1-8B-Instruct model Dubey et al. [2024] with tensor parallelism across eight A100 GPUs. (...) For this experiment, we use a single H100 GPU and the Meta-Llama-3.2-1BInstruct model Dubey et al. [2024]. |
| Software Dependencies | Yes | We run the experiment using the SGLang v0.4.1 serving framework. (...) We use the Llama-3.1-8B-Instruct model Dubey et al. [2024] (...) We implement the k-LPM algorithm as an extension to the current LPM implementation in SGLang. |
| Experiment Setup | Yes | In particular, we evaluate the timing metrics using SGLang s serving benchmark utility SGLang [2024] and only modify the benchmarking dataset. We implement the k-LPM algorithm as an extension to the current LPM implementation in SGLang. Finally, we construct the dataset used for benchmarking by sampling four prompts with shared user history from the 8k context length prompts described in Team [2025] for 2100 prompts in total. We then randomly shuffle the ordering of these prompts and use 100 for warm up of the benchmarking server. (...) We measure P99 TTFT vs. request rate for varying values of k in the k-LPM algorithm. Note that SGLang s benchmarking utility uses a Poisson arrival process, so request rate refers to the average number of requests per second. |