DistillSpec: Improving Speculative Decoding via Knowledge Distillation

Authors: Yongchao Zhou, Kaifeng Lyu, Ankit Singh Rawat, Aditya Krishna Menon, Afshin Rostamizadeh, Sanjiv Kumar, Jean-François Kagy, Rishabh Agarwal

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Notably, Distill Spec yields 10 45% speedups over standard SD on a range of benchmarks, using both greedy and non-greedy sampling. We show that the distilled model can be well transferred to various tasks with an average speedup of 26%. Furthermore, we combine Distill Spec with lossy SD to achieve fine-grained control over the latency vs. task performance trade-off. Finally, in practical scenarios with models of varying sizes, first using distillation to boost the performance of the target model and then applying Distill Spec to train a well-aligned draft model can reduce decoding latency by 6 10 with minimal performance drop, compared to standard decoding without distillation.
Researcher Affiliation Collaboration Yongchao Zhou1,3 , Kaifeng Lyu1,4 , Ankit Singh Rawat1, Aditya Krishna Menon1, Afshin Rostamizadeh1, Sanjiv Kumar1, Jean-François Kagy1 , Rishabh Agarwal2,5 1Google Research 2Google Deep Mind 3University of Toronto 4Princeton University 5Mila
Pseudocode Yes Algorithm A.1 Speculative decoding step" and "Algorithm A.2 Knowledge distillation" (found in Appendix A.3).
Open Source Code No The paper discusses the use of 'open-source LLMs' but does not provide an explicit statement or link indicating that the code for the Distill Spec methodology described in the paper is open-source or publicly available.
Open Datasets Yes XSum (Narayan et al., 2018). The Extreme Summarization (XSum) dataset serves as an evaluation benchmark for abstractive single-document summarization systems. ... CNN/DM (Hermann et al., 2015). ... WMT En De (Bojar et al., 2014). ... GSM8K (Cobbe et al., 2021). ... LM1B Chelba et al. (2013). The One Billion Word dataset (LM1B) is a widely recognized benchmark for language modeling.
Dataset Splits No The paper mentions using 'validation dataset split' and 'validation sets' for evaluation, but it does not provide specific details on the percentages, sample counts, or the methodology for creating the training, validation, and test dataset splits.
Hardware Specification Yes To measure the actual latency, we follow Leviathan et al. (2023) and execute both our target model and draft models are on the same TPUv4 device without utilizing model parallelism.
Software Dependencies No The paper mentions using an 'Adafactor optimizer' and 'T5 tokenizer' but does not provide specific version numbers for these or any other key software components required for replication.
Experiment Setup Yes A summary of the hyperparameters used in our knowledge distillation process can be found in Table B.1. Table B.1: Hyperparameters for distillation experiments. hyperparameter value training steps 300,000 batch size 32 dropout 0.0 learning rate (LR) 0.0003 LR warmup steps 5,000 LR cooldown (begin, end) (150,000, 300,000) warmup schedule linear (from 0 to LR) cooldown schedule cosine decay (from LR to 0.1LR)