HexGen: Generative Inference of Large Language Model over Heterogeneous Environment

Authors: Youhe Jiang, Ran Yan, Xiaozhe Yao, Yang Zhou, Beidi Chen, Binhang Yuan

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct an extensive evaluation to verify the efficiency of HEXGEN by serving the state-of-the-art LLAMA2 (70B) model. The results suggest that HEXGEN can choose to achieve up to 2.3 lower latency deadlines or tolerate up to 4 more request rates compared with the homogeneous baseline given the same budget.
Researcher Affiliation Academia 1Department of Computer Science and Engineering, The Hong Kong University of Science and Technology, Hong Kong, China 2Department of Computer Science, ETH Zurich, Z urich, Switzerland 3Department of Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, Pennsylvania.
Pseudocode Yes Algorithm 1 Estimate optimal pipeline cost.
Open Source Code Yes Our implementation is available at https://github.com/ Relaxed-System-Lab/Hex Gen.
Open Datasets Yes We apply the most popular open-source LLAMA-2 (70B) model on some real-world prompts (Lmsys, 2023)
Dataset Splits No The paper does not provide explicit details about data splits for training, validation, or testing, such as percentages or sample counts. It focuses on inference performance of pre-trained models.
Hardware Specification Yes We rent two AWS on-demand p4d.24xlarge instances, each equipped with 8 NVIDIA A100-40G GPUs... we rent two 3090Ti 8 instances in Iceland, two 3090Ti 3 instances in Norway, one A5000 8 in Nevada, two A6000 8 instances, one A5000 8 instances and one A40 4 instances in Illinois.
Software Dependencies No The paper mentions software like "Flash Attention (Dao, 2023) framework" and "lib P2P (Lib P2P, 2023)" but does not provide specific version numbers for these dependencies.
Experiment Setup Yes We apply the most popular open-source LLAMA-2 (70B) model on some real-world prompts (Lmsys, 2023), and test output sequence lengths from 32 to 128, and request rates varying between 0.125 10 requests per second.