MuxServe: Flexible Spatial-Temporal Multiplexing for Multiple LLM Serving
Authors: Jiangfei Duan, Runyu Lu, Haojie Duanmu, Xiuhong Li, Xingcheng Zhang, Dahua Lin, Ion Stoica, Hao Zhang
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Evaluation results show that Mux Serve can achieves up to 1.8 higher throughput or processes 2.9 more requests within 99% SLO attainment. The code is available at: https:// github.com/hao-ai-lab/Mux Serve. |
| Researcher Affiliation | Academia | 1The Chinese University of Hong Kong 2Shanghai AI Laboratory 3Huazhong University of Science and Technology 4Shanghai Jiao Tong University 5Peking University 6UC Berkeley 7University of California San Diego. |
| Pseudocode | Yes | Algorithm 1 Enumeration-based Greedy LLM Placement; Algorithm 2 LLM Parallel Candidate Generation; Algorithm 3 Adaptive Batch Scheduling (ADBS) |
| Open Source Code | Yes | The code is available at: https:// github.com/hao-ai-lab/Mux Serve. |
| Open Datasets | Yes | The requests are sampled from Share GPT. (Share GPT-Team, 2023) |
| Dataset Splits | No | The paper describes the generation of synthetic workloads and sampling from real traces (Share GPT, Chat LMSYS trace) but does not provide specific training, validation, or test dataset splits for the experiments. |
| Hardware Specification | Yes | We conduct experiments on a 4 node cluster, each is equipped with 8 NVIDIA A100 (80GB) GPUs. |
| Software Dependencies | No | Mux Serve is built atop vLLM (Kwon et al., 2023), an efficient single LLM serving system based on Py Torch (Paszke et al., 2019), and utilizes NVIDIA MPS (NVIDIA, 2022b) to partition SM resources. While software components are mentioned, specific version numbers for PyTorch or NVIDIA MPS are not provided, which is required for reproducibility. |
| Experiment Setup | Yes | For synthetic workloads, we first generate request rates for each LLM using power-law distribution with an exponent α, then sample the arrival time of each request with poisson processes. The requests are sampled from Share GPT. We vary α and rate scales to evaluate a diverse workloads. For each α, we first set the maximal request rate for each LLM to 20 req/s, and then scale up the max rate and average rate for evaluation. |