Speculative Decoding with CTC-based Draft Model for LLM Inference Acceleration
Authors: Zhuofan Wen, Shangtong Gui, Yang Feng
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiment results show that compared to strong baselines, the proposed method can achieve a higher acceptance rate and hence a faster inference speed. |
| Researcher Affiliation | Academia | 1Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences 2State Key Lab of Processors, Institute of Computing Technology, Chinese Academy of Sciences 3Key Laboratory of Al Safety, Chinese Academy of Sciences 4University of Chinese Academy of Sciences, Beijing, China {wenzhuofan24z,guishangtong21s,fengyang}@ict.ac.cn |
| Pseudocode | No | The paper includes a diagram (Figure 1) illustrating the model structure and strategy, and describes procedures in text, but does not contain explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: We provide the code in Supplementary Material. |
| Open Datasets | Yes | We choose open-source Vicuna large language model[6] with different parameter sizes as base model to conduct experiments. Vicuna models is fine-tuned on Share GPT dataset based on LLa MA model, which are noted below as Vicuna-7b, Vicuna-13b and Vicuna-33b according to different parameter sizes. We also conduct training on LLa MA-2-Chat base models, detailed in the Appendix. [...] Trained models are evaluated on MT-bench and GSM8K datasets to assess the acceleration performance in various scenarios. MT-Bench is a carefully curated benchmark that includes 80 highquality, multi-turn questions covering 8 primary categories of user prompts such as writing, roleplay and extraction[27]. GSM8K contains 8.5K high quality linguistically diverse grade school math problems[7]. |
| Dataset Splits | No | The paper mentions using Share GPT dataset for training and MT-bench and GSM8K for evaluation, but does not specify explicit train/validation/test dataset splits with percentages or sample counts. |
| Hardware Specification | Yes | All training tasks were executed on four 24GB NVIDIA Ge Force RTX 3090 devices, taking around two days. |
| Software Dependencies | No | The paper mentions using FP16 precision, but does not specify software dependencies like libraries or frameworks with their version numbers. |
| Experiment Setup | Yes | The learning rate is set to 3 10 5. To avoid gradient explosion, we adopt gradient clipping, setting the clipping threshold to 0.5. We set the max length of training data to 2048. |