Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
HyGen: Efficient LLM Serving via Elastic Online-Offline Request Co-location
Authors: Ting Sun, Penghan Wang, Fan Lai
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our evaluation on production workloads shows that Hy Gen achieves up to 3.9-5.8 throughput gains over online and hybrid serving baselines, while ensuring latency SLOs. The paper includes a dedicated '5 Performance Evaluation' section with subsections on '5.1 Evaluation Setup', '5.2 End-to-end Performance', '5.3 Performance Breakdown', and '5.4 Ablation Studies', presenting empirical results and comparisons. |
| Researcher Affiliation | Academia | Ting Sun 1, Penghan Wang 2, Fan Lai1 1 Siebel School of Computing and Data Science, University of Illinois Urbana-Champaign 2 Department of Computer Science, Purdue University EMAIL, EMAIL, EMAIL |
| Pseudocode | Yes | Algorithm 1 Hy Gen SLO-aware scheduler Algorithm 2 Hy Gen two-phase scheduler Algorithm 3 Prefix-sharing-aware offline scheduler Algorithm 4 Prefix-sharing-aware offline scheduler |
| Open Source Code | Yes | The code of Hy Gen is publicly available at https://github.com/UIUC-MLSys/Hy Gen. |
| Open Datasets | Yes | Datasets. We list the license of used datasets as follows: ar Xiv summarization dataset[11]: Apache-2.0 License; Azure LLM Inference trace[56]: CC-BY-4.0; MMLU dataset [20, 21]: MIT License. |
| Dataset Splits | No | The paper describes how workloads were generated: 'Online workloads are based on the conversation trace from Azure LLM inference trace 2023 [41], a one-hour production trace with real-world requests and timestamps. We randomly sampled the trace to achieve the desired QPS that suits our hardware serving capacity. Specifically, within a time duration of T seconds, we would sample T Q requests to suit a desired QPS Q.' This explains how the workload data was prepared for the serving system evaluation, but it does not specify traditional training/test/validation dataset splits for model development or evaluation, as the paper evaluates a serving system rather than training new models. |
| Hardware Specification | Yes | We evaluate Hy Gen on three server configurations: one with 4 NVIDIA A100 GPUs (40GB VRAM each), one with 4 NVIDIA A40 GPUs (48GB VRAM each), and one with 1 NVIDIA A5000 GPU (24GB VRAM). All servers have 64 CPU cores, 256GB DDR4 RAM, and a 1.5TB NVMe SSD. |
| Software Dependencies | No | We implement Hy Gen on top of v LLM [29, 57] and Sarathi [2, 46], with 1,300 lines of additional code. For end-to-end evaluation, we use Llama2-7B [56] and Qwen-14B [6] models. The paper mentions the software and models used but does not provide specific version numbers for these components. |
| Experiment Setup | Yes | We evaluate Hy Gen under four SLO metrics (mean TBT, P99 TBT, mean TTFT, and P99 TTFT) with varied interference tolerance ratios. For pure offline serving, we use Sarathi-offline to evaluate the maximum offline serving capacity, where an optimal chunk size is profiled for offline workload to maximize throughput. The hyperparameter search of Sarathioffline achieves 12% throughput gain compared to the default setup, ensuring optimal baseline performance for fair comparison. Hy Gen* serves offline requests at a specific offline QPS to control overall interference. The offline QPS is profiled using a similar design with the Hy Gen profiler to guarantee bounded SLO interference. |