Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
HexGen: Generative Inference of Large Language Model over Heterogeneous Environment
Authors: Youhe Jiang, Ran Yan, Xiaozhe Yao, Yang Zhou, Beidi Chen, Binhang Yuan
ICML 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct an extensive evaluation to verify the efficiency of HEXGEN by serving the state-of-the-art LLAMA2 (70B) model. The results suggest that HEXGEN can choose to achieve up to 2.3 lower latency deadlines or tolerate up to 4 more request rates compared with the homogeneous baseline given the same budget. |
| Researcher Affiliation | Academia | 1Department of Computer Science and Engineering, The Hong Kong University of Science and Technology, Hong Kong, China 2Department of Computer Science, ETH Zurich, Z urich, Switzerland 3Department of Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, Pennsylvania. |
| Pseudocode | Yes | Algorithm 1 Estimate optimal pipeline cost. |
| Open Source Code | Yes | Our implementation is available at https://github.com/ Relaxed-System-Lab/Hex Gen. |
| Open Datasets | Yes | We apply the most popular open-source LLAMA-2 (70B) model on some real-world prompts (Lmsys, 2023) |
| Dataset Splits | No | The paper does not provide explicit details about data splits for training, validation, or testing, such as percentages or sample counts. It focuses on inference performance of pre-trained models. |
| Hardware Specification | Yes | We rent two AWS on-demand p4d.24xlarge instances, each equipped with 8 NVIDIA A100-40G GPUs... we rent two 3090Ti 8 instances in Iceland, two 3090Ti 3 instances in Norway, one A5000 8 in Nevada, two A6000 8 instances, one A5000 8 instances and one A40 4 instances in Illinois. |
| Software Dependencies | No | The paper mentions software like "Flash Attention (Dao, 2023) framework" and "lib P2P (Lib P2P, 2023)" but does not provide specific version numbers for these dependencies. |
| Experiment Setup | Yes | We apply the most popular open-source LLAMA-2 (70B) model on some real-world prompts (Lmsys, 2023), and test output sequence lengths from 32 to 128, and request rates varying between 0.125 10 requests per second. |