Online Speculative Decoding
Authors: Xiaoxuan Liu, Lanxiang Hu, Peter Bailis, Alvin Cheung, Zhijie Deng, Ion Stoica, Hao Zhang
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We develop a prototype of online speculative decoding based on knowledge distillation and evaluate it using both synthetic and real query data. The results show a substantial increase in the token acceptance rate by 0.1 to 0.65, bringing 1.42 to 2.17 latency reduction. |
| Researcher Affiliation | Collaboration | 1UC Berkeley 2UCSD 3Google Inc. 4SJTU. Correspondence to: Hao, Zhang <haozhang@ucsd.edu>, Zhijie, Deng <zhijied@sjtu.edu.cn>. |
| Pseudocode | Yes | Algorithm 1 Online Speculative Decoding. |
| Open Source Code | Yes | Our code is available at https: //github.com/Liu Xiaoxuan PKU/OSD. |
| Open Datasets | Yes | We evaluate performance across four diverse datasets: Text-to-SQL (Spider) (Yu et al., 2018), graduate school math (Gsm8k) (Cobbe et al., 2021), Python code generation (Code-search-Python) (Husain et al., 2019), and financial question answering (Alpaca-finance) (Bharti, 2023). |
| Dataset Splits | No | The paper mentions training and test sets but does not explicitly describe a separate validation set or its split ratios for hyperparameter tuning or model selection. |
| Hardware Specification | Yes | We conduct the experiments with llamacpp (Gerganov, 2023) on a single A100-80G. |
| Software Dependencies | No | The paper mentions "llamacpp (Gerganov, 2023)" and "Huggingface Transformer library (hft, 2023)" but does not provide specific version numbers for these software dependencies or other programming languages/libraries used. |
| Experiment Setup | Yes | In all experiments, we set the number of proposed tokens to 5 for speculative decoding. For all online experiments, we fix the update interval I at 8. |