Kangaroo: Lossless Self-Speculative Decoding for Accelerating LLMs via Double Early Exiting
Authors: Fangcheng Liu, Yehui Tang, Zhenhua Liu, Yunsheng Ni, Duyu Tang, Kai Han, Yunhe Wang
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on multiple benchmarks demonstrate our effectiveness, where Kangaroo achieves walltime speedups up to 2.04 , outperforming Medusa-1 with 88.7% fewer additional parameters. |
| Researcher Affiliation | Industry | Huawei Noah s Ark Lab Consumer Business Group, Huawei {liufangcheng3,yehui.tang,kai.han,yunhe.wang}@huawei.com |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | The code for Kangaroo is available at https://github.com/Equationliu/Kangaroo. |
| Open Datasets | Yes | We conduct experiments on Vicuna [12] models with size of 7B and 13B. ... For Kangaroo, we train the adapter A for 10 epochs with the Adam W [42] optimizer on the Share GPT dataset following Medusa [20]. |
| Dataset Splits | Yes | We evaluate the acceleration performance with the recently proposed Spec-Bench [22], which consists of six subtasks including Multi-turn Conversation, Translation, Summarization, Question Answering, Mathematical Reasoning and Retrieval-augmented Generation. |
| Hardware Specification | Yes | The training of the adapter A for Vicuna-7B takes around 24 hours on 8 NVIDIA V100 GPUs. |
| Software Dependencies | No | The paper mentions using the Adam W [42] optimizer but does not provide specific version numbers for other software dependencies like programming languages or libraries. |
| Experiment Setup | Yes | For Kangaroo, we train the adapter A for 10 epochs with the Adam W [42] optimizer on the Share GPT dataset following Medusa [20]. ... During the inference stage, we set ℓ= 2 for Vicuna-7B and ℓ= 3 for Vicuna-13B. For the single-sequence decoding in Kangaroo, we set γ = 6 and η = 0.6. For the dynamic tree decoding scenario, we set Top-K as 10, and η = 0.4. |