reproducibilityindex.ai

Cascade Speculative Drafting for Even Faster LLM Inference

Authors: Ziyi Chen, Xiaocong Yang, Jiacheng Lin, Chenkai Sun, Kevin Chang, Jie Huang

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Through theoretical analysis and empirical studies, we demonstrate that the CS Drafting algorithm outperforms speculative decoding in terms of latency across various tasks and settings, achieving an additional speedup of up to 81% over speculative decoding. These findings highlight the practical advantages and efficiency enhancements offered by both vertical and horizontal cascades.
Researcher Affiliation	Academia	Ziyi Chen Xiaocong Yang Jiacheng Lin Chenkai Sun Kevin Chen-Chuan Chang Jie Huang University of Illinois at Urbana-Champaign {ziyic2, kcchang, jeffhj}@illinois.edu
Pseudocode	Yes	Combining the horizontal and vertical cascades, the algorithm of cascade speculative decoding is presented in Algorithm 1. At its center, the horizontal cascade is realized by the for loop, while the vertical cascade is implemented through recursive calls.
Open Source Code	Yes	1Code is publicly available at https://github.com/lfsszd/CS-Drafting.
Open Datasets	Yes	Datasets We chose two commonly used datasets for our experiments. For both datasets, we conducted experiments in a zero-shot chain-of-thought setup [13, 23]: GSM8K [7] is a dataset comprising 8,500 high-quality, linguistically diverse, grade-school math word problems. MMLU [10], or Massive Multitask Language Understanding, is a benchmark for testing how well large language models grasp knowledge.
Dataset Splits	Yes	Datasets We chose two commonly used datasets for our experiments. For both datasets, we conducted experiments in a zero-shot chain-of-thought setup [13, 23]: GSM8K [7] is a dataset comprising 8,500 high-quality, linguistically diverse, grade-school math word problems. MMLU [10], or Massive Multitask Language Understanding, is a benchmark for testing how well large language models grasp knowledge.
Hardware Specification	Yes	All of our experiments involving walltime are performed on a single NVIDIA A40 GPU.
Software Dependencies	No	The paper mentions models like FLAN-T5 and Vicuna-7B, but does not provide specific version numbers for software dependencies such as PyTorch, TensorFlow, or CUDA.
Experiment Setup	Yes	We include hyperparameter details in Appendix B. [...] The k-matrix for CS Drafting is [[2, 10], [0, 10]]. When adding tree attention, we limit it to only the lead node with the highest probability of having children; the k-matrix is [[1, 3], [0, 1]] with the number of children for each leading node being 8, while the other nodes have no children. [...] Table 5: The experimental results on FLAN-T5 with hyperparameter details. Speedup (MS) is the standardized walltime improvement with the assumption that the latency of each run of a model is its number of parameters (model size). Speedup (PW) is the SWI with the assumption that the latency of each run of a model is the time cost data reported from previous work [14]. k11, k12, k22, l are the hyperparameters.