reproducibilityindex.ai

HexGen: Generative Inference of Large Language Model over Heterogeneous Environment

Authors: Youhe Jiang, Ran Yan, Xiaozhe Yao, Yang Zhou, Beidi Chen, Binhang Yuan

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct an extensive evaluation to verify the efficiency of HEXGEN by serving the state-of-the-art LLAMA2 (70B) model. The results suggest that HEXGEN can choose to achieve up to 2.3 lower latency deadlines or tolerate up to 4 more request rates compared with the homogeneous baseline given the same budget.
Researcher Affiliation	Academia	1Department of Computer Science and Engineering, The Hong Kong University of Science and Technology, Hong Kong, China 2Department of Computer Science, ETH Zurich, Z urich, Switzerland 3Department of Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, Pennsylvania.
Pseudocode	Yes	Algorithm 1 Estimate optimal pipeline cost.
Open Source Code	Yes	Our implementation is available at https://github.com/ Relaxed-System-Lab/Hex Gen.
Open Datasets	Yes	We apply the most popular open-source LLAMA-2 (70B) model on some real-world prompts (Lmsys, 2023)
Dataset Splits	No	The paper does not provide explicit details about data splits for training, validation, or testing, such as percentages or sample counts. It focuses on inference performance of pre-trained models.
Hardware Specification	Yes	We rent two AWS on-demand p4d.24xlarge instances, each equipped with 8 NVIDIA A100-40G GPUs... we rent two 3090Ti 8 instances in Iceland, two 3090Ti 3 instances in Norway, one A5000 8 in Nevada, two A6000 8 instances, one A5000 8 instances and one A40 4 instances in Illinois.
Software Dependencies	No	The paper mentions software like "Flash Attention (Dao, 2023) framework" and "lib P2P (Lib P2P, 2023)" but does not provide specific version numbers for these dependencies.
Experiment Setup	Yes	We apply the most popular open-source LLAMA-2 (70B) model on some real-world prompts (Lmsys, 2023), and test output sequence lengths from 32 to 128, and request rates varying between 0.125 10 requests per second.