Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Transcending Cost-Quality Tradeoff in Agent Serving via Session-Awareness

Authors: Yanyu Ren, Li Chen, Dan Li, Xizheng Wang, Zhiyuan Wu, Yukai Miao, Yu Bai

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on real testbeds demonstrate that AGSERVE (1) achieves comparable response quality to GPT-4o at a 16.5% cost. (2) delivers 1.8 improvement in quality relative to the tradeoff curve.
Researcher Affiliation	Academia	1Tsinghua University 2Zhongguancun Laboratory EMAIL EMAIL EMAIL
Pseudocode	Yes	Algorithm 1: ECE Eviction Policy Input: Number of ongoing sessions n, number of running sessions r, number of waiting sessions w, session cache size (can be zero) l[1 . . . n] ranked by ETA, maximum batch size K, prefilling per-block overhead P, decoding time consumption D Output: Available session nums k, sessions to evict E
Open Source Code	Yes	Our work is available at https://github.com/robinren03/agserve.
Open Datasets	Yes	We evaluate the performance of AGSERVE based on Agent Bench [29] over two testbeds. [...] We train the judge on a customized Chatbot-Arena dataset [8]. [...] The Alf World (AW) [45] is an embodied household assistant navigating and interacting within simulated environments.
Dataset Splits	Yes	We train the Q-Judge for 10 epochs on top of BERT [22] with a batch size of 16 and a warm-up step of 500, utilizing one A6000 GPU. [...] The distribution of the Q-Judge train set and the results on the evaluation set are shown in Table 5. [...] We train the R-Judge for 10 epochs with a weight decay of 1e-2 and warm-up steps of 500. [...] The distribution of the R-Judge train set and the result on the evaluation set are shown in Table 6 with GT representing ground truth.
Hardware Specification	Yes	The first consists of two nodes, each equipped with four A6000 GPUs (48GB per GPU). The second comprises two nodes, each equipped with eight A800 GPUs (80GB per GPU), interconnected via PCIe.
Software Dependencies	No	We implement AGSERVE in Python with SAS based on v LLM [24]. We also implement customized CUDA kernels to support batched in-place KV cache calibration.
Experiment Setup	Yes	We train the Q-Judge for 10 epochs on top of BERT [22] with a batch size of 16 and a warm-up step of 500, utilizing one A6000 GPU. [...] We set the quality control threshold θ of RJθ( ) to 0.5. The reasoning quality check frequency ν is set to 4 for all agents except M2W. AGSERVE checks the LLM s reasoning quality for every response (ν = 1), as an M2W session usually finishes in 2 to 3 rounds. [...] We set the max_memory_utilization of SAS to 0.37 (equivalent to 17.8GB, 3GB of CUDA graph not included) and the max_model_len to 4096 on one A6000 GPU. The temperature is set to 1, following the Open AI API default.