GliDe with a CaPE: A Low-Hassle Method to Accelerate Speculative Decoding

Authors: Cunxiao Du, Jing Jiang, Xu Yuanchen, Jiawei Wu, Sicheng Yu, Yongqi Li, Shenggui Li, Kai Xu, Liqiang Nie, Zhaopeng Tu, Yang You

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on different benchmarks demonstrate that our proposed GLIDE draft model significantly reduces the expected decoding latency. Additional evaluation using walltime reveals that GLIDE can accelerate Vicuna models up to 2.17x and further extend the improvement to 2.61x with CAPE.
Researcher Affiliation Collaboration 1Singapore Management University 2National University of Singapore 3The Hong Kong Polytechnic University 4HPC-AI Tech. 5Harbin Institute of Technology (Shenzhen) 6Tencent AI Lab.
Pseudocode Yes See Algorithm 1 for the proposal expansion and Algorithm 2 for the verification.
Open Source Code Yes We release our code, data, and the trained draft models at https: //github.com/Nonvolatile Memory/ Gli De_with_a_Ca PE_ICML_24.
Open Datasets Yes We first train our draft model on the pre-training dataset Slim Pajama-6B (Soboleva et al., 2023). We then finetune the draft model on a supervised-finetuning (SFT) dataset (Share GPT (GPT3.5 & 4, 2023) in our case) to further improve the model performance. Following Liu et al. (2023), we evaluate our GLIDE method across four different datasets: GSM8K (Cobbe et al., 2021) (math reasoning), Finance-Alpaca (Bharti, 2023) (QA for finance), Spider (Yu et al., 2018) (text-to-SQL), and Code-Search-Python (Husain et al., 2020) (Python code generation). We follow (Cai et al., 2023) and use the well-known benchmark dataset MT-Bench (Zheng et al., 2023) for the evaluation of CAPE.
Dataset Splits No The paper mentions training and finetuning datasets, but does not specify train/validation/test dataset splits with percentages or sample counts.
Hardware Specification Yes In the case of the 7B and 13B target models, we train GLIDE with zero2 and eight H800 GPUs. For the 33B target model, we use zero3 and 16 H800 GPUs. All the inference processes in this paper are performed using fp16 and on a single H800 GPU.
Software Dependencies No The paper mentions 'fp16' and 'adam W (Kingma & Ba, 2015)' but does not provide specific version numbers for software dependencies such as Python, PyTorch, or CUDA.
Experiment Setup Yes We set batch size (with accumulation) as 64, learning rate equals to 5e-4, and use adam W (Kingma & Ba, 2015) to optimize the draft model. We only train our draft model for one epoch on both pretrain and SFT datasets. We set proposal length γ to be 5 and adopt speculative sampling as our acceptance strategy, following (Liu et al., 2023).