SGLang: Efficient Execution of Structured Language Model Programs
Authors: Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue (Livia) Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark Barrett, Ying Sheng
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments show that SGLang achieves up to 6.4 higher throughput compared to state-of-the-art inference systems on various large language and multi-modal models on tasks including agent control, logical reasoning, few-shot learning benchmarks, JSON decoding, retrieval-augmented generation pipelines, and multi-turn chat. We evaluate the performance of SGLang across diverse LLM workloads. Subsequently, we conduct ablation studies and case studies to demonstrate the effectiveness of specific components. |
| Researcher Affiliation | Academia | Lianmin Zheng2 Liangsheng Yin3 Zhiqiang Xie1 Chuyue Sun1 Jeff Huang4 Cody Hao Yu5 Shiyi Cao2 Christos Kozyrakis1 Ion Stoica2 Joseph E. Gonzalez2 Clark Barrett1 Ying Sheng1 1 Stanford University 2 UC Berkeley 3 Shanghai Jiao Tong University 4 Texas A&M University 5 Independent Researcher |
| Pseudocode | Yes | Alg. 1 shows the pseudocode of cache-aware scheduling for Radix Attention with continuous batching. |
| Open Source Code | Yes | The code is publicly available at https://github.com/sgl-project/sglang. |
| Open Datasets | Yes | We test the following: 5-shot MMLU [14] and 20-shot Hella Swag [61] benchmarks. SGLang has been deployed in Chatbot Arena [8] to serve open-weight models. We test the overhead of Radix Attention on a benchmark without any KV cache reuse opportunities. The benchmark measures throughput on the Share GPT dataset. |
| Dataset Splits | No | The paper uses well-known benchmarks and models, but does not explicitly specify train/validation/test dataset splits with percentages or sample counts for the main experiments. For a specific compiler optimization case study, it mentions '5 of these templates as few-shot training examples and the remaining 15 as test cases', but this is not a general dataset split for all experiments. |
| Hardware Specification | Yes | We run most experiments on AWS EC2 G5 instances, which are equipped with NVIDIA A10G GPUs (24GB). We run 7B models on a single A10G GPU and larger models on multiple A10G GPUs with tensor parallelism [44]. We run some additional experiments on A100G (80GB) GPUs. |
| Software Dependencies | No | SGLang is implemented in PyTorch [37] with custom CUDA kernels from Flash Infer [59] and Triton [48]. |
| Experiment Setup | Yes | Models. We test dense Llama-2 models [49], sparse mixture of experts Mixtral models [17], multimodal LLa VA image [27] and video models [62], and API model Open AI s GPT-3.5. For open-weight models, the number of parameters ranges from 7 billion to 70 billion, and we use float16 precision. Baselines. We compare SGLang against both high-level programming systems with their respective languages and default runtimes, as well as low-level inference engines with standard Open AI-like Completion APIs. Unless otherwise stated, we do not turn on optimizations that will change the computation results so that all systems compute the same results. The baselines include: Guidance[13], a language for controlling LLMs. We use Guidance v0.1.8 with llama.cpp backend. v LLM [23], a high-throughput inference engine. We use v LLM v0.2.5 and its default API server3. LMQL [4], a query language. We use LMQL v0.7.3 with Hugging Face Transformers backend. |