MuxServe: Flexible Spatial-Temporal Multiplexing for Multiple LLM Serving

Authors: Jiangfei Duan, Runyu Lu, Haojie Duanmu, Xiuhong Li, Xingcheng Zhang, Dahua Lin, Ion Stoica, Hao Zhang

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Evaluation results show that Mux Serve can achieves up to 1.8 higher throughput or processes 2.9 more requests within 99% SLO attainment. The code is available at: https:// github.com/hao-ai-lab/Mux Serve.
Researcher Affiliation Academia 1The Chinese University of Hong Kong 2Shanghai AI Laboratory 3Huazhong University of Science and Technology 4Shanghai Jiao Tong University 5Peking University 6UC Berkeley 7University of California San Diego.
Pseudocode Yes Algorithm 1 Enumeration-based Greedy LLM Placement; Algorithm 2 LLM Parallel Candidate Generation; Algorithm 3 Adaptive Batch Scheduling (ADBS)
Open Source Code Yes The code is available at: https:// github.com/hao-ai-lab/Mux Serve.
Open Datasets Yes The requests are sampled from Share GPT. (Share GPT-Team, 2023)
Dataset Splits No The paper describes the generation of synthetic workloads and sampling from real traces (Share GPT, Chat LMSYS trace) but does not provide specific training, validation, or test dataset splits for the experiments.
Hardware Specification Yes We conduct experiments on a 4 node cluster, each is equipped with 8 NVIDIA A100 (80GB) GPUs.
Software Dependencies No Mux Serve is built atop vLLM (Kwon et al., 2023), an efficient single LLM serving system based on Py Torch (Paszke et al., 2019), and utilizes NVIDIA MPS (NVIDIA, 2022b) to partition SM resources. While software components are mentioned, specific version numbers for PyTorch or NVIDIA MPS are not provided, which is required for reproducibility.
Experiment Setup Yes For synthetic workloads, we first generate request rates for each LLM using power-law distribution with an exponent α, then sample the arrival time of each request with poisson processes. The requests are sampled from Share GPT. We vary α and rate scales to evaluate a diverse workloads. For each α, we first set the maximal request rate for each LLM to 20 req/s, and then scale up the max rate and average rate for evaluation.