FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU

Authors: Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Beidi Chen, Percy Liang, Christopher Re, Ion Stoica, Ce Zhang

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We present Flex Gen, a high-throughput generation engine for running LLMs with limited GPU memory... Flex Gen achieves significantly higher throughput compared to state-of-the-art offloading systems, reaching a generation throughput of 1 token/s for the first time with an effective batch size of 144. On the HELM benchmark, Flex Gen can benchmark a 30B model with a 16GB GPU on 7 representative sub-scenarios in 21 hours.
Researcher Affiliation Collaboration 1Stanford University 2UC Berkeley 3ETH Zurich 4Yandex 5HSE University 6Meta 7Carnegie Mellon University.
Pseudocode Yes Algorithm 1 Block Schedule with Overlapping
Open Source Code Yes The code is available at https: //github.com/FMInference/Flex Gen.
Open Datasets Yes We use synthetic datasets where all prompts are padded to the same length. ... We use two tasks to show that our approximation methods exhibit negligible accuracy loss: next-word prediction on Lambada (Paperno et al., 2016) and language modeling on Wiki Text (Merity et al., 2016).
Dataset Splits No The paper uses pre-trained models (OPT models) and synthetic datasets for inference throughput evaluation. It does not describe explicit training, validation, or test splits for these datasets within the context of its own experiments, as its focus is on generative inference rather than model training.
Hardware Specification Yes We run experiments on the NVIDIA T4 GPU instances from Google Cloud. The hardware specifications are listed in Table 1. GPU NVIDIA T4 16 GB CPU Intel Xeon @ 2.00GHz 208 GB Disk Cloud default SSD (NVMe) 1.5 TB.
Software Dependencies No The paper states 'Flex Gen is implemented on top of Py Torch (Paszke et al., 2019)' but does not provide a specific version number for PyTorch or any other software dependencies crucial for replication.
Experiment Setup Yes On a single T4 GPU with 208 GB CPU DRAM and 1.5 TB SSD, input sequence length 512, and output sequence length 32