Online Speculative Decoding

Authors: Xiaoxuan Liu, Lanxiang Hu, Peter Bailis, Alvin Cheung, Zhijie Deng, Ion Stoica, Hao Zhang

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We develop a prototype of online speculative decoding based on knowledge distillation and evaluate it using both synthetic and real query data. The results show a substantial increase in the token acceptance rate by 0.1 to 0.65, bringing 1.42 to 2.17 latency reduction.
Researcher Affiliation Collaboration 1UC Berkeley 2UCSD 3Google Inc. 4SJTU. Correspondence to: Hao, Zhang <haozhang@ucsd.edu>, Zhijie, Deng <zhijied@sjtu.edu.cn>.
Pseudocode Yes Algorithm 1 Online Speculative Decoding.
Open Source Code Yes Our code is available at https: //github.com/Liu Xiaoxuan PKU/OSD.
Open Datasets Yes We evaluate performance across four diverse datasets: Text-to-SQL (Spider) (Yu et al., 2018), graduate school math (Gsm8k) (Cobbe et al., 2021), Python code generation (Code-search-Python) (Husain et al., 2019), and financial question answering (Alpaca-finance) (Bharti, 2023).
Dataset Splits No The paper mentions training and test sets but does not explicitly describe a separate validation set or its split ratios for hyperparameter tuning or model selection.
Hardware Specification Yes We conduct the experiments with llamacpp (Gerganov, 2023) on a single A100-80G.
Software Dependencies No The paper mentions "llamacpp (Gerganov, 2023)" and "Huggingface Transformer library (hft, 2023)" but does not provide specific version numbers for these software dependencies or other programming languages/libraries used.
Experiment Setup Yes In all experiments, we set the number of proposed tokens to 5 for speculative decoding. For all online experiments, we fix the update interval I at 8.