Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Semi-Clairvoyant Scheduling of Speculative Decoding Requests to Minimize LLM Inference Latency

Authors: Ruixiao Li, Fahao Chen, Peng Li

IJCAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments show that LAPS-SD reduces inference latency by approximately 39% compared to state-of-the-art scheduling methods. We evaluate LAPS-SD using three commonly used datasets: Chatbot Instruction Prompts [Alessandro Palla, 2023], MBPP [Austin et al., 2021], and Mini Thinky [Xuan Son NGUYEN, 2024].
Researcher Affiliation Academia 1School of Cyber Science and Engineering, Xi an Jiaotong University 2School of Computer Science and Engineering, The University of Aizu EMAIL, EMAIL, EMAIL
Pseudocode Yes Algorithm 1 The Proposed LAPS-SD Scheduling Algorithm Input: Arrival speculative decoding requests; 1: Initialize priority queues; 2: Initialize requests states as non-perceptible; 3: Put all requests in the queue with highest priority; 4: Schedule requests Inter Queue Schedule(); 5: procedure INTERQUEUESCHED ULE( ) 6: for Non-empty queue with the highest queue do 7: Schedule requests Intra Queue Schedule() 8: end for 9: end procedure 10: procedure INTRAQUEUESCHED ULE( ) 11: if Request becomes stable then 12: Change request s state to perceptible; 13: Predict the acceptance rate and request length; 14: Estimate the execution time; 15: end if 16: Schedule requests with semi-clairvoyant strategy; 17: end procedure
Open Source Code No The paper does not contain any explicit statements or links indicating the release of open-source code for the described methodology.
Open Datasets Yes We evaluate LAPS-SD using three commonly used datasets: Chatbot Instruction Prompts [Alessandro Palla, 2023], MBPP [Austin et al., 2021], and Mini Thinky [Xuan Son NGUYEN, 2024], following the setup in [Miao et al., 2024]. [...] [Alessandro Palla, 2023] Alessandro Palla. Chatbot instruction prompts. https://huggingface.co/datasets/alespalla/ chatbot instruction prompts, 2023. [...] [Xuan Son NGUYEN, 2024] Xuan Son NGUYEN. Minithinky dataset. https://huggingface.co/datasets/ngxson/ Mini Thinky-dataset, 2024.
Dataset Splits No The paper mentions using three datasets: Chatbot Instruction Prompts, MBPP, and Mini Thinky, but it does not provide specific details on how these datasets were split into training, validation, or test sets.
Hardware Specification Yes Environment. We evaluate LAPS-SD on an NVIDIA L20 GPU with 48GB memory.
Software Dependencies Yes The system runs Ubuntu 20.04.6 with Linux kernel version 5.15.0-91-generic, NVIDIA driver 550.120, CUDA 12.4, and cu DNN 8.6.0. The algorithm is implemented in Pytorch version 2.5.1.
Experiment Setup Yes We assume that the batch size of this serving system is set to 1 for clear presentation.