Cascade Speculative Drafting for Even Faster LLM Inference
Authors: Ziyi Chen, Xiaocong Yang, Jiacheng Lin, Chenkai Sun, Kevin Chang, Jie Huang
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through theoretical analysis and empirical studies, we demonstrate that the CS Drafting algorithm outperforms speculative decoding in terms of latency across various tasks and settings, achieving an additional speedup of up to 81% over speculative decoding. These findings highlight the practical advantages and efficiency enhancements offered by both vertical and horizontal cascades. |
| Researcher Affiliation | Academia | Ziyi Chen Xiaocong Yang Jiacheng Lin Chenkai Sun Kevin Chen-Chuan Chang Jie Huang University of Illinois at Urbana-Champaign {ziyic2, kcchang, jeffhj}@illinois.edu |
| Pseudocode | Yes | Combining the horizontal and vertical cascades, the algorithm of cascade speculative decoding is presented in Algorithm 1. At its center, the horizontal cascade is realized by the for loop, while the vertical cascade is implemented through recursive calls. |
| Open Source Code | Yes | 1Code is publicly available at https://github.com/lfsszd/CS-Drafting. |
| Open Datasets | Yes | Datasets We chose two commonly used datasets for our experiments. For both datasets, we conducted experiments in a zero-shot chain-of-thought setup [13, 23]: GSM8K [7] is a dataset comprising 8,500 high-quality, linguistically diverse, grade-school math word problems. MMLU [10], or Massive Multitask Language Understanding, is a benchmark for testing how well large language models grasp knowledge. |
| Dataset Splits | Yes | Datasets We chose two commonly used datasets for our experiments. For both datasets, we conducted experiments in a zero-shot chain-of-thought setup [13, 23]: GSM8K [7] is a dataset comprising 8,500 high-quality, linguistically diverse, grade-school math word problems. MMLU [10], or Massive Multitask Language Understanding, is a benchmark for testing how well large language models grasp knowledge. |
| Hardware Specification | Yes | All of our experiments involving walltime are performed on a single NVIDIA A40 GPU. |
| Software Dependencies | No | The paper mentions models like FLAN-T5 and Vicuna-7B, but does not provide specific version numbers for software dependencies such as PyTorch, TensorFlow, or CUDA. |
| Experiment Setup | Yes | We include hyperparameter details in Appendix B. [...] The k-matrix for CS Drafting is [[2, 10], [0, 10]]. When adding tree attention, we limit it to only the lead node with the highest probability of having children; the k-matrix is [[1, 3], [0, 1]] with the number of children for each leading node being 8, while the other nodes have no children. [...] Table 5: The experimental results on FLAN-T5 with hyperparameter details. Speedup (MS) is the standardized walltime improvement with the assumption that the latency of each run of a model is its number of parameters (model size). Speedup (PW) is the SWI with the assumption that the latency of each run of a model is the time cost data reported from previous work [14]. k11, k12, k22, l are the hyperparameters. |