reproducibilityindex.ai

Speculative Decoding with CTC-based Draft Model for LLM Inference Acceleration

Authors: Zhuofan Wen, Shangtong Gui, Yang Feng

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiment results show that compared to strong baselines, the proposed method can achieve a higher acceptance rate and hence a faster inference speed.
Researcher Affiliation	Academia	1Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences 2State Key Lab of Processors, Institute of Computing Technology, Chinese Academy of Sciences 3Key Laboratory of Al Safety, Chinese Academy of Sciences 4University of Chinese Academy of Sciences, Beijing, China {wenzhuofan24z,guishangtong21s,fengyang}@ict.ac.cn
Pseudocode	No	The paper includes a diagram (Figure 1) illustrating the model structure and strategy, and describes procedures in text, but does not contain explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: We provide the code in Supplementary Material.
Open Datasets	Yes	We choose open-source Vicuna large language model[6] with different parameter sizes as base model to conduct experiments. Vicuna models is fine-tuned on Share GPT dataset based on LLa MA model, which are noted below as Vicuna-7b, Vicuna-13b and Vicuna-33b according to different parameter sizes. We also conduct training on LLa MA-2-Chat base models, detailed in the Appendix. [...] Trained models are evaluated on MT-bench and GSM8K datasets to assess the acceleration performance in various scenarios. MT-Bench is a carefully curated benchmark that includes 80 highquality, multi-turn questions covering 8 primary categories of user prompts such as writing, roleplay and extraction[27]. GSM8K contains 8.5K high quality linguistically diverse grade school math problems[7].
Dataset Splits	No	The paper mentions using Share GPT dataset for training and MT-bench and GSM8K for evaluation, but does not specify explicit train/validation/test dataset splits with percentages or sample counts.
Hardware Specification	Yes	All training tasks were executed on four 24GB NVIDIA Ge Force RTX 3090 devices, taking around two days.
Software Dependencies	No	The paper mentions using FP16 precision, but does not specify software dependencies like libraries or frameworks with their version numbers.
Experiment Setup	Yes	The learning rate is set to 3 10 5. To avoid gradient explosion, we adopt gradient clipping, setting the clipping threshold to 0.5. We set the max length of training data to 2048.