GliDe with a CaPE: A Low-Hassle Method to Accelerate Speculative Decoding
Authors: Cunxiao Du, Jing Jiang, Xu Yuanchen, Jiawei Wu, Sicheng Yu, Yongqi Li, Shenggui Li, Kai Xu, Liqiang Nie, Zhaopeng Tu, Yang You
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on different benchmarks demonstrate that our proposed GLIDE draft model significantly reduces the expected decoding latency. Additional evaluation using walltime reveals that GLIDE can accelerate Vicuna models up to 2.17x and further extend the improvement to 2.61x with CAPE. |
| Researcher Affiliation | Collaboration | 1Singapore Management University 2National University of Singapore 3The Hong Kong Polytechnic University 4HPC-AI Tech. 5Harbin Institute of Technology (Shenzhen) 6Tencent AI Lab. |
| Pseudocode | Yes | See Algorithm 1 for the proposal expansion and Algorithm 2 for the verification. |
| Open Source Code | Yes | We release our code, data, and the trained draft models at https: //github.com/Nonvolatile Memory/ Gli De_with_a_Ca PE_ICML_24. |
| Open Datasets | Yes | We first train our draft model on the pre-training dataset Slim Pajama-6B (Soboleva et al., 2023). We then finetune the draft model on a supervised-finetuning (SFT) dataset (Share GPT (GPT3.5 & 4, 2023) in our case) to further improve the model performance. Following Liu et al. (2023), we evaluate our GLIDE method across four different datasets: GSM8K (Cobbe et al., 2021) (math reasoning), Finance-Alpaca (Bharti, 2023) (QA for finance), Spider (Yu et al., 2018) (text-to-SQL), and Code-Search-Python (Husain et al., 2020) (Python code generation). We follow (Cai et al., 2023) and use the well-known benchmark dataset MT-Bench (Zheng et al., 2023) for the evaluation of CAPE. |
| Dataset Splits | No | The paper mentions training and finetuning datasets, but does not specify train/validation/test dataset splits with percentages or sample counts. |
| Hardware Specification | Yes | In the case of the 7B and 13B target models, we train GLIDE with zero2 and eight H800 GPUs. For the 33B target model, we use zero3 and 16 H800 GPUs. All the inference processes in this paper are performed using fp16 and on a single H800 GPU. |
| Software Dependencies | No | The paper mentions 'fp16' and 'adam W (Kingma & Ba, 2015)' but does not provide specific version numbers for software dependencies such as Python, PyTorch, or CUDA. |
| Experiment Setup | Yes | We set batch size (with accumulation) as 64, learning rate equals to 5e-4, and use adam W (Kingma & Ba, 2015) to optimize the draft model. We only train our draft model for one epoch on both pretrain and SFT datasets. We set proposal length γ to be 5 and adopt speculative sampling as our acceptance strategy, following (Liu et al., 2023). |