SGLang: Efficient Execution of Structured Language Model Programs

Authors: Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue (Livia) Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark Barrett, Ying Sheng

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments show that SGLang achieves up to 6.4 higher throughput compared to state-of-the-art inference systems on various large language and multi-modal models on tasks including agent control, logical reasoning, few-shot learning benchmarks, JSON decoding, retrieval-augmented generation pipelines, and multi-turn chat. We evaluate the performance of SGLang across diverse LLM workloads. Subsequently, we conduct ablation studies and case studies to demonstrate the effectiveness of specific components.
Researcher Affiliation Academia Lianmin Zheng2 Liangsheng Yin3 Zhiqiang Xie1 Chuyue Sun1 Jeff Huang4 Cody Hao Yu5 Shiyi Cao2 Christos Kozyrakis1 Ion Stoica2 Joseph E. Gonzalez2 Clark Barrett1 Ying Sheng1 1 Stanford University 2 UC Berkeley 3 Shanghai Jiao Tong University 4 Texas A&M University 5 Independent Researcher
Pseudocode Yes Alg. 1 shows the pseudocode of cache-aware scheduling for Radix Attention with continuous batching.
Open Source Code Yes The code is publicly available at https://github.com/sgl-project/sglang.
Open Datasets Yes We test the following: 5-shot MMLU [14] and 20-shot Hella Swag [61] benchmarks. SGLang has been deployed in Chatbot Arena [8] to serve open-weight models. We test the overhead of Radix Attention on a benchmark without any KV cache reuse opportunities. The benchmark measures throughput on the Share GPT dataset.
Dataset Splits No The paper uses well-known benchmarks and models, but does not explicitly specify train/validation/test dataset splits with percentages or sample counts for the main experiments. For a specific compiler optimization case study, it mentions '5 of these templates as few-shot training examples and the remaining 15 as test cases', but this is not a general dataset split for all experiments.
Hardware Specification Yes We run most experiments on AWS EC2 G5 instances, which are equipped with NVIDIA A10G GPUs (24GB). We run 7B models on a single A10G GPU and larger models on multiple A10G GPUs with tensor parallelism [44]. We run some additional experiments on A100G (80GB) GPUs.
Software Dependencies No SGLang is implemented in PyTorch [37] with custom CUDA kernels from Flash Infer [59] and Triton [48].
Experiment Setup Yes Models. We test dense Llama-2 models [49], sparse mixture of experts Mixtral models [17], multimodal LLa VA image [27] and video models [62], and API model Open AI s GPT-3.5. For open-weight models, the number of parameters ranges from 7 billion to 70 billion, and we use float16 precision. Baselines. We compare SGLang against both high-level programming systems with their respective languages and default runtimes, as well as low-level inference engines with standard Open AI-like Completion APIs. Unless otherwise stated, we do not turn on optimizations that will change the computation results so that all systems compute the same results. The baselines include: Guidance[13], a language for controlling LLMs. We use Guidance v0.1.8 with llama.cpp backend. v LLM [23], a high-throughput inference engine. We use v LLM v0.2.5 and its default API server3. LMQL [4], a query language. We use LMQL v0.7.3 with Hugging Face Transformers backend.