reproducibilityindex.ai

MuxServe: Flexible Spatial-Temporal Multiplexing for Multiple LLM Serving

Authors: Jiangfei Duan, Runyu Lu, Haojie Duanmu, Xiuhong Li, Xingcheng Zhang, Dahua Lin, Ion Stoica, Hao Zhang

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Evaluation results show that Mux Serve can achieves up to 1.8 higher throughput or processes 2.9 more requests within 99% SLO attainment. The code is available at: https:// github.com/hao-ai-lab/Mux Serve.
Researcher Affiliation	Academia	1The Chinese University of Hong Kong 2Shanghai AI Laboratory 3Huazhong University of Science and Technology 4Shanghai Jiao Tong University 5Peking University 6UC Berkeley 7University of California San Diego.
Pseudocode	Yes	Algorithm 1 Enumeration-based Greedy LLM Placement; Algorithm 2 LLM Parallel Candidate Generation; Algorithm 3 Adaptive Batch Scheduling (ADBS)
Open Source Code	Yes	The code is available at: https:// github.com/hao-ai-lab/Mux Serve.
Open Datasets	Yes	The requests are sampled from Share GPT. (Share GPT-Team, 2023)
Dataset Splits	No	The paper describes the generation of synthetic workloads and sampling from real traces (Share GPT, Chat LMSYS trace) but does not provide specific training, validation, or test dataset splits for the experiments.
Hardware Specification	Yes	We conduct experiments on a 4 node cluster, each is equipped with 8 NVIDIA A100 (80GB) GPUs.
Software Dependencies	No	Mux Serve is built atop vLLM (Kwon et al., 2023), an efficient single LLM serving system based on Py Torch (Paszke et al., 2019), and utilizes NVIDIA MPS (NVIDIA, 2022b) to partition SM resources. While software components are mentioned, specific version numbers for PyTorch or NVIDIA MPS are not provided, which is required for reproducibility.
Experiment Setup	Yes	For synthetic workloads, we first generate request rates for each LLM using power-law distribution with an exponent α, then sample the arrival time of each request with poisson processes. The requests are sampled from Share GPT. We vary α and rate scales to evaluate a diverse workloads. For each α, we first set the maximal request rate for each LLM to 20 req/s, and then scale up the max rate and average rate for evaluation.