Efficient LLM Scheduling by Learning to Rank

Authors: Yichao Fu, Siqi Zhu, Runlong Su, Aurick Qiao, Ion Stoica, Hao Zhang

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this paper, we reexamine this assumption we show that, although predicting the exact generation length of each request is infeasible, it is possible to predict the relative ranks of output lengths in a batch of requests, using learning to rank. The ranking information offers valuable guidance for scheduling requests. Building on this insight, we develop a novel scheduler for LLM inference and serving that can approximate the shortest-job-first (SJF) schedule better than existing approaches. We integrate this scheduler with the state-of-the-art LLM serving system and show significant performance improvement in several important applications: 2.8x lower latency in chatbot serving and 6.5x higher throughput in synthetic data generation.
Researcher Affiliation Collaboration Yichao Fu1 Siqi Zhu2 Runlong Su1 Aurick Qiao3 Ion Stoica4 Hao Zhang1 1UCSD 2Tsinghua University 3Snowflake 4 UC Berkeley
Pseudocode Yes Algorithm 1 Ranking Scheduler
Open Source Code Yes Our code is available at https://github.com/hao-ai-lab/vllm-ltr.git.
Open Datasets Yes We utilize the latest Meta Llama-3 models in two sizes: 8B and 70B [40]. All experiments use FP16/BF16 precision, which is the most common setting in LLM deployment. The 8B model runs on a single GPU, while the 70B model runs on 8 GPUs with tensor parallelism [41]. Workloads. We evaluate using the Share GPT [42] and LMSYS-Chat-1M [43] datasets, which comprise open-ended, real-world conversations with proprietary LLM chatbots such as Chat GPT [1] and Claude, as well as 25 other open-source LLMs.
Dataset Splits No The paper mentions 10k non-overlapping prompts for serving (testing) and 10k for training the ranking predictor, and an evaluation on a 'randomly sampled test set'. However, it does not specify a distinct validation set split (e.g., percentages or counts) for the overall model evaluation or hyperparameter tuning.
Hardware Specification Yes Testbed. Our end-to-end evaluation testbed consists of a DGX server with 8 NVIDIA A100 40GB GPUs, 256 v CPUs, and 1TB host memory. The GPUs are interconnected via NVLink. ... These tests were conducted on a single A100 80GB GPU. ... These experiments were conducted using a Llama-3-8B model on a single 80GB A100 GPU.
Software Dependencies Yes We compare our method (i.e., ranking predictor) with four baselines implemented on top of v LLM v0.4.1
Experiment Setup Yes We train the OPT on 10k samples with a batch size of 32 for 5 epochs. We employ the List MLE loss and the Adam optimizer with a constant learning rate of 2e 5, β1 = 0.9, and β2 = 0.999. To accommodate OPT s context length limitations, we truncate prompts to a maximum of 2,048 tokens.