HexGen: Generative Inference of Large Language Model over Heterogeneous Environment
Authors: Youhe Jiang, Ran Yan, Xiaozhe Yao, Yang Zhou, Beidi Chen, Binhang Yuan
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct an extensive evaluation to verify the efficiency of HEXGEN by serving the state-of-the-art LLAMA2 (70B) model. The results suggest that HEXGEN can choose to achieve up to 2.3 lower latency deadlines or tolerate up to 4 more request rates compared with the homogeneous baseline given the same budget. |
| Researcher Affiliation | Academia | 1Department of Computer Science and Engineering, The Hong Kong University of Science and Technology, Hong Kong, China 2Department of Computer Science, ETH Zurich, Z urich, Switzerland 3Department of Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, Pennsylvania. |
| Pseudocode | Yes | Algorithm 1 Estimate optimal pipeline cost. |
| Open Source Code | Yes | Our implementation is available at https://github.com/ Relaxed-System-Lab/Hex Gen. |
| Open Datasets | Yes | We apply the most popular open-source LLAMA-2 (70B) model on some real-world prompts (Lmsys, 2023) |
| Dataset Splits | No | The paper does not provide explicit details about data splits for training, validation, or testing, such as percentages or sample counts. It focuses on inference performance of pre-trained models. |
| Hardware Specification | Yes | We rent two AWS on-demand p4d.24xlarge instances, each equipped with 8 NVIDIA A100-40G GPUs... we rent two 3090Ti 8 instances in Iceland, two 3090Ti 3 instances in Norway, one A5000 8 in Nevada, two A6000 8 instances, one A5000 8 instances and one A40 4 instances in Illinois. |
| Software Dependencies | No | The paper mentions software like "Flash Attention (Dao, 2023) framework" and "lib P2P (Lib P2P, 2023)" but does not provide specific version numbers for these dependencies. |
| Experiment Setup | Yes | We apply the most popular open-source LLAMA-2 (70B) model on some real-world prompts (Lmsys, 2023), and test output sequence lengths from 32 to 128, and request rates varying between 0.125 10 requests per second. |