Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Demystifying Cost-Efficiency in LLM Serving over Heterogeneous GPUs
Authors: Youhe Jiang, Fangcheng Fu, Xiaozhe Yao, Guoliang He, Xupeng Miao, Ana Klimovic, Bin Cui, Binhang Yuan, Eiko Yoneki
ICML 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The first contribution is a comprehensive benchmarking of LLM serving over various GPU types, which offers a detailed understanding of cost-efficiency with heterogeneous GPU resources. ... We empirically evaluate our framework by comparing it with both homogeneous and heterogeneous baselines across a variety of scenarios, covering diverse workload traces, varying GPU availablilities, and multi-model serving. The results demonstrate that, within the same price budget, our approach can achieve up to 41% and on average 20% higher throughput, or reduce the serving latency by up to 54% and on average 20%. |
| Researcher Affiliation | Academia | 1Department of Computer Science, University of Cambridge, Cambridgeshire, UK 2Department of Computer Science and Engineering, The Hong Kong University of Science and Technology, Hong Kong, China 3Department of Computer Science, Peking University, Beijing, China 4Department of Computer Science, ETH Zurich, Z urich, Switzerland 5Department of Computer Science, Purdue University, West Lafayette, Indiana, US. Correspondence to: Binhang Yuan <EMAIL>, Eiko Yoneki <EMAIL>. |
| Pseudocode | Yes | Algorithm 1 Binary Search on T |
| Open Source Code | No | The paper does not provide a specific link to a code repository, an explicit statement about releasing the code for this work, or mention code in supplementary materials. It mentions using 'vLLM (Kwon et al., 2023)' for experiments, which is a third-party tool, but not its own implementation code. |
| Open Datasets | Yes | We subsample nine workload types from the Share GPT (Zheng et al.), Wild GPT (Zhao et al.), and Azure-Trace datasets (Patel et al., 2024). ... Azure. Azure public dataset, 2024. URL https://github.com/Azure/Azure Public Dataset. ... Zheng, L., Chiang, W.-L., Sheng, Y., Li, T., Zhuang, S., Wu, Z., Zhuang, Y., Li, Z., Lin, Z., Xing, E., et al. Lmsys-chat1m: A large-scale real-world llm conversation dataset. In The Twelfth International Conference on Learning Representations. ... Zhao, W., Ren, X., Hessel, J., Cardie, C., Choi, Y., and Deng, Y. Wildchat: 1m chatgpt interaction logs in the wild. In The Twelfth International Conference on Learning Representations. |
| Dataset Splits | Yes | Our testing traces are subsampled from three sources: real workload traces collected over one month from the Swiss AI Center, the Wild Chat dataset, and the production traces Azure-Trace. Each trace comprises multiple workload types, with their ratios shown in Table 5 in Appendix I. ... Table 5: Workload type ratios for subsampled traces from the Swiss AI Center (Trace 1), Azure-Trace (Trace 2), and Wild GPT dataset (Trace 3). Workloads 1 9 correspond to the nine workload types shown in Figure 4 from left to right. |
| Hardware Specification | Yes | Our experiments are conducted using two types of data center servers H100 and A100, three types of work station servers A40, RTX A6000 and L40, and one type of consumer server RTX 4090. In data center servers, GPUs are linked by NVLink (300 GB/s), while in workstation/consumer servers, GPUs are linked by PCIe (60 GB/s). Servers with inter-connection are connected via Ethernet with a bandwidth of 5 Gb/s. |
| Software Dependencies | No | All experiments are conducted with v LLM (Kwon et al., 2023). The paper mentions 'vLLM' as the software used but does not provide a specific version number for it. |
| Experiment Setup | Yes | Benchmark settings. We subsample nine workload types from the Share GPT (Zheng et al.), Wild GPT (Zhao et al.), and Azure-Trace datasets (Patel et al., 2024). ... We evaluate two models, Llama3-8B and Llama3-70B, on six commonly used cloud GPUs (A6000, A40, L40, A100, H100, and 4090) with different deployment configurations. ... Figure 4 presents the benchmark results of various deployment configurations across different models, workloads, and GPU types. The three-element array represents the DP, TP, and PP degrees. |