InferCept: Efficient Intercept Support for Augmented Large Language Model Inference

Authors: Reyna Abhyankar, Zijian He, Vikranth Srivatsa, Hao Zhang, Yiying Zhang

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We implement INFERCEPT on top of v LLM (Kwon et al., 2023), a state-of-the-art LLM inference system. We evaluate INFERCEPT, v LLM (Discard), Preserve, and Swap on A100 GPUs using three LLMs (GPT-J-6B (Wang & Komatsuzaki, 2021), Vicu na-13B (Zheng et al., 2023a), and Llama3-70B (Meta, 2024)) and the six interception types we study. Overall, INFERCEPT sustains 1.6 -2 higher serving load than v LLM while maintaining similar latency per token generation. INFERCEPT also achieves over 2 more completed requests per second.
Researcher Affiliation Academia 1University of California, San Diego, La Jolla, United States. Correspondence to: Reyna Abhyankar <vabhyank@ucsd.edu>, Zijian He <zih015@ucsd.edu>, Yiying Zhang <yiying@ucsd.edu>.
Pseudocode No No pseudocode or algorithm blocks were found in the paper.
Open Source Code Yes INFERCEPT is available at https://github.com/Wuk Lab/ Infer Cept.
Open Datasets Yes We evaluate this use case with the GSM8K-XL (Hao et al., 2023) dataset, which contains 8.5K high-quality grade-school math problems. We use the Multihop QA Wikipedia (Yang et al., 2018) dataset to evaluate this use case. To evaluate VE, we use the ALFWorld dataset (Shridhar et al., 2021)... We use the Share GPT dataset (Zheng et al., 2023a)... We use Chat GPT to create a dataset by generating a series of image-generation prompts, each triggering a call to the Stable Diffusion model (Rombach et al., 2021)... we use Chat-GPT to generate a series of prompts, each triggering a call to the Bark TTS model (AI, 2023).
Dataset Splits No We run augmented GPT-J on one NVIDIA A100 GPU. For Vicuna, we use two environments: running on a single A100 GPU and distributed on two A100 GPUs with tensor parallelism. For Llama3, we distribute it on four A100 GPUs with tensor parallelism. To mimic real-world serving scenarios that often receive different types of requests, we use a request dataset that merges the six augmentations presented in 2 by uniformly sampling requests from them.
Hardware Specification Yes We evaluate INFERCEPT, v LLM (Discard), Preserve, and Swap on A100 GPUs using three LLMs (GPT-J-6B (Wang & Komatsuzaki, 2021), Vicu na-13B (Zheng et al., 2023a), and Llama3-70B (Meta, 2024)). We run augmented GPT-J on one NVIDIA A100 GPU. For Vicuna, we use two environments: running on a single A100 GPU and distributed on two A100 GPUs with tensor parallelism. For Llama3, we distribute it on four A100 GPUs with tensor parallelism.
Software Dependencies No We implement INFERCEPT on top of v LLM (Kwon et al., 2023) to leverage its Paged Attention technique for regular LLM memory management, Most of our techniques are highly modular and orthogonal to optimizations designed for non-intercepted LLMs. Thus, INFERCEPT can be potentially integrated into other LLM serving systems like Deep Speed (Aminabadi et al., 2022), Orca (Yu et al., 2022), and Tensor RT-LLM (Vaidya et al., 2023).
Experiment Setup No To mimic real-world serving scenarios that often receive different types of requests, we use a request dataset that merges the six augmentations presented in 2 by uniformly sampling requests from them. We run augmented GPT-J on one NVIDIA A100 GPU. For Vicuna, we use two environments: running on a single A100 GPU and distributed on two A100 GPUs with tensor parallelism. For Llama3, we distribute it on four A100 GPUs with tensor parallelism. Following recent LLM inference research papers (Kwon et al., 2023; Yu et al., 2022), we first report the serving throughput as normalized latency (i.e., the median of every request s end-to-end latency divided by its output length) when varying request load (number of requests arrived per second).