FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU
Authors: Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Beidi Chen, Percy Liang, Christopher Re, Ion Stoica, Ce Zhang
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We present Flex Gen, a high-throughput generation engine for running LLMs with limited GPU memory... Flex Gen achieves significantly higher throughput compared to state-of-the-art offloading systems, reaching a generation throughput of 1 token/s for the first time with an effective batch size of 144. On the HELM benchmark, Flex Gen can benchmark a 30B model with a 16GB GPU on 7 representative sub-scenarios in 21 hours. |
| Researcher Affiliation | Collaboration | 1Stanford University 2UC Berkeley 3ETH Zurich 4Yandex 5HSE University 6Meta 7Carnegie Mellon University. |
| Pseudocode | Yes | Algorithm 1 Block Schedule with Overlapping |
| Open Source Code | Yes | The code is available at https: //github.com/FMInference/Flex Gen. |
| Open Datasets | Yes | We use synthetic datasets where all prompts are padded to the same length. ... We use two tasks to show that our approximation methods exhibit negligible accuracy loss: next-word prediction on Lambada (Paperno et al., 2016) and language modeling on Wiki Text (Merity et al., 2016). |
| Dataset Splits | No | The paper uses pre-trained models (OPT models) and synthetic datasets for inference throughput evaluation. It does not describe explicit training, validation, or test splits for these datasets within the context of its own experiments, as its focus is on generative inference rather than model training. |
| Hardware Specification | Yes | We run experiments on the NVIDIA T4 GPU instances from Google Cloud. The hardware specifications are listed in Table 1. GPU NVIDIA T4 16 GB CPU Intel Xeon @ 2.00GHz 208 GB Disk Cloud default SSD (NVMe) 1.5 TB. |
| Software Dependencies | No | The paper states 'Flex Gen is implemented on top of Py Torch (Paszke et al., 2019)' but does not provide a specific version number for PyTorch or any other software dependencies crucial for replication. |
| Experiment Setup | Yes | On a single T4 GPU with 208 GB CPU DRAM and 1.5 TB SSD, input sequence length 512, and output sequence length 32 |