Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

KVFlow: Efficient Prefix Caching for Accelerating LLM-Based Multi-Agent Workflows

Authors: Zaifeng Pan, AJJKUMAR DAHYALAL PATEL, Yipeng Shen, Zhengding Hu, Yue Guan, Wan-Lu Li, Lianhui Qin, Yida Wang, Yufei Ding

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate KVFlow across a range of microbenchmarks to understand its performance under different caching and execution conditions. Our experiments aim to answer the following key questions: (1) Can KVFlow reduce end-to-end latency for individual workflows with large prompt prefixes and limited GPU memory? (2) How does KVFlow perform under high concurrency, where multiple workflows run in parallel? To answer these questions, we first analyze single-workflow latency in Section 4.1, and then study multi-workflow execution in Section 4.2.
Researcher Affiliation Collaboration Zaifeng Pan1 Ajjkumar Patel1 Yipeng Shen1 Zhengding Hu1 Yue Guan1 Wan-Lu Li1 Lianhui Qin1 Yida Wang2 Yufei Ding1 1 UCSD 2 AWS
Pseudocode Yes Pseudocode. To more clearly illustrate how KVFlow integrates with KV cache management, we include pseudocode for both eviction priority assignment and the eviction procedure. When a new agent request arrives with its Agent Step Graph (ASG) information, KVFlow updates the eviction priority of each cache node following Algorithm 1. When the system needs to evict cache nodes on the GPU to free memory (e.g., during prefill, decode, or prefetch), it proceeds from the leaf nodes in the cache tree based on their priorities, as shown in Algorithm 2.
Open Source Code No Answer: [No] Justification: We will open source our codes at https://github.com/Pan Zaifeng/ KVFlow.
Open Datasets Yes To better reflect real-world deployment scenarios, we simulate agentic workflows based on the PEER [5] framework. ... We use the Financial QA dataset from PEER as the workflow input.
Dataset Splits No We generate synthetic input prompts by randomly sampling token sequences with controlled lengths for both parts. We evaluate two variants: (a) fully deterministic sequential workflows where each stage only has one agent, i.e., branches=1; and (b) moderately dynamic workflows, where each stage randomly selects one of two agents with partially shared prefixes, i.e., branches=2. ... We then execute the 10-stage workflow ten times, each with a varying dynamic suffix, to obtain the end-to-end latency for these ten runs.
Hardware Specification Yes We conduct experiments on Qwen2.5-32B on an NVIDIA H100 GPU with 80GB memory and 64 GB/s PCIe Gen5 bandwidth. Qwen uses 40 attention heads and 8 KV heads.
Software Dependencies Yes We implement the prototype of KVFlow based on SGLang v0.4.4 [12], an efficient LLM serving system that provides both a backend for LLM execution and a frontend interface for application development.
Experiment Setup Yes We adopt deterministic decoding (temperature = 0, greedy sampling) to ensure consistent latency measurements. This setting represents scenarios with tight GPU memory constraints when long fixed prefixes contend for cache space.