InferCept: Efficient Intercept Support for Augmented Large Language Model Inference
Authors: Reyna Abhyankar, Zijian He, Vikranth Srivatsa, Hao Zhang, Yiying Zhang
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We implement INFERCEPT on top of v LLM (Kwon et al., 2023), a state-of-the-art LLM inference system. We evaluate INFERCEPT, v LLM (Discard), Preserve, and Swap on A100 GPUs using three LLMs (GPT-J-6B (Wang & Komatsuzaki, 2021), Vicu na-13B (Zheng et al., 2023a), and Llama3-70B (Meta, 2024)) and the six interception types we study. Overall, INFERCEPT sustains 1.6 -2 higher serving load than v LLM while maintaining similar latency per token generation. INFERCEPT also achieves over 2 more completed requests per second. |
| Researcher Affiliation | Academia | 1University of California, San Diego, La Jolla, United States. Correspondence to: Reyna Abhyankar <vabhyank@ucsd.edu>, Zijian He <zih015@ucsd.edu>, Yiying Zhang <yiying@ucsd.edu>. |
| Pseudocode | No | No pseudocode or algorithm blocks were found in the paper. |
| Open Source Code | Yes | INFERCEPT is available at https://github.com/Wuk Lab/ Infer Cept. |
| Open Datasets | Yes | We evaluate this use case with the GSM8K-XL (Hao et al., 2023) dataset, which contains 8.5K high-quality grade-school math problems. We use the Multihop QA Wikipedia (Yang et al., 2018) dataset to evaluate this use case. To evaluate VE, we use the ALFWorld dataset (Shridhar et al., 2021)... We use the Share GPT dataset (Zheng et al., 2023a)... We use Chat GPT to create a dataset by generating a series of image-generation prompts, each triggering a call to the Stable Diffusion model (Rombach et al., 2021)... we use Chat-GPT to generate a series of prompts, each triggering a call to the Bark TTS model (AI, 2023). |
| Dataset Splits | No | We run augmented GPT-J on one NVIDIA A100 GPU. For Vicuna, we use two environments: running on a single A100 GPU and distributed on two A100 GPUs with tensor parallelism. For Llama3, we distribute it on four A100 GPUs with tensor parallelism. To mimic real-world serving scenarios that often receive different types of requests, we use a request dataset that merges the six augmentations presented in 2 by uniformly sampling requests from them. |
| Hardware Specification | Yes | We evaluate INFERCEPT, v LLM (Discard), Preserve, and Swap on A100 GPUs using three LLMs (GPT-J-6B (Wang & Komatsuzaki, 2021), Vicu na-13B (Zheng et al., 2023a), and Llama3-70B (Meta, 2024)). We run augmented GPT-J on one NVIDIA A100 GPU. For Vicuna, we use two environments: running on a single A100 GPU and distributed on two A100 GPUs with tensor parallelism. For Llama3, we distribute it on four A100 GPUs with tensor parallelism. |
| Software Dependencies | No | We implement INFERCEPT on top of v LLM (Kwon et al., 2023) to leverage its Paged Attention technique for regular LLM memory management, Most of our techniques are highly modular and orthogonal to optimizations designed for non-intercepted LLMs. Thus, INFERCEPT can be potentially integrated into other LLM serving systems like Deep Speed (Aminabadi et al., 2022), Orca (Yu et al., 2022), and Tensor RT-LLM (Vaidya et al., 2023). |
| Experiment Setup | No | To mimic real-world serving scenarios that often receive different types of requests, we use a request dataset that merges the six augmentations presented in 2 by uniformly sampling requests from them. We run augmented GPT-J on one NVIDIA A100 GPU. For Vicuna, we use two environments: running on a single A100 GPU and distributed on two A100 GPUs with tensor parallelism. For Llama3, we distribute it on four A100 GPUs with tensor parallelism. Following recent LLM inference research papers (Kwon et al., 2023; Yu et al., 2022), we first report the serving throughput as normalized latency (i.e., the median of every request s end-to-end latency divided by its output length) when varying request load (number of requests arrived per second). |