Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Transcending Cost-Quality Tradeoff in Agent Serving via Session-Awareness
Authors: Yanyu Ren, Li Chen, Dan Li, Xizheng Wang, Zhiyuan Wu, Yukai Miao, Yu Bai
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on real testbeds demonstrate that AGSERVE (1) achieves comparable response quality to GPT-4o at a 16.5% cost. (2) delivers 1.8 improvement in quality relative to the tradeoff curve. |
| Researcher Affiliation | Academia | 1Tsinghua University 2Zhongguancun Laboratory EMAIL EMAIL EMAIL |
| Pseudocode | Yes | Algorithm 1: ECE Eviction Policy Input: Number of ongoing sessions n, number of running sessions r, number of waiting sessions w, session cache size (can be zero) l[1 . . . n] ranked by ETA, maximum batch size K, prefilling per-block overhead P, decoding time consumption D Output: Available session nums k, sessions to evict E |
| Open Source Code | Yes | Our work is available at https://github.com/robinren03/agserve. |
| Open Datasets | Yes | We evaluate the performance of AGSERVE based on Agent Bench [29] over two testbeds. [...] We train the judge on a customized Chatbot-Arena dataset [8]. [...] The Alf World (AW) [45] is an embodied household assistant navigating and interacting within simulated environments. |
| Dataset Splits | Yes | We train the Q-Judge for 10 epochs on top of BERT [22] with a batch size of 16 and a warm-up step of 500, utilizing one A6000 GPU. [...] The distribution of the Q-Judge train set and the results on the evaluation set are shown in Table 5. [...] We train the R-Judge for 10 epochs with a weight decay of 1e-2 and warm-up steps of 500. [...] The distribution of the R-Judge train set and the result on the evaluation set are shown in Table 6 with GT representing ground truth. |
| Hardware Specification | Yes | The first consists of two nodes, each equipped with four A6000 GPUs (48GB per GPU). The second comprises two nodes, each equipped with eight A800 GPUs (80GB per GPU), interconnected via PCIe. |
| Software Dependencies | No | We implement AGSERVE in Python with SAS based on v LLM [24]. We also implement customized CUDA kernels to support batched in-place KV cache calibration. |
| Experiment Setup | Yes | We train the Q-Judge for 10 epochs on top of BERT [22] with a batch size of 16 and a warm-up step of 500, utilizing one A6000 GPU. [...] We set the quality control threshold θ of RJθ( ) to 0.5. The reasoning quality check frequency ν is set to 4 for all agents except M2W. AGSERVE checks the LLM s reasoning quality for every response (ν = 1), as an M2W session usually finishes in 2 to 3 rounds. [...] We set the max_memory_utilization of SAS to 0.37 (equivalent to 17.8GB, 3GB of CUDA graph not included) and the max_model_len to 4096 on one A6000 GPU. The temperature is set to 1, following the Open AI API default. |