Sequoia: Scalable and Robust Speculative Decoding
Authors: Zhuoming Chen, Avner May, Ruslan Svirschevski, Yu-Hsun Huang, Max Ryabinin, Zhihao Jia, Beidi Chen
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In Section 4, we perform extensive end-to-end experiments and ablation studies to demonstrate the effectiveness of SEQUOIA. We implement SEQUOIA on top of Hugging Face [45] with CUDA Graphs [31, 32]. We show that SEQUOIA achieves up to 4.04 speedup for Llama2-7B on a single A100 GPU and 9.5 for Llama3-70B-Instruct in the offloading setting on an L40 GPU. |
| Researcher Affiliation | Collaboration | Zhuoming Chen1 Avner May2 Ruslan Svirschevski3,4 Yuhsun Huang1 Max Ryabinin2 Zhihao Jia1 Beidi Chen1,5 1Carnegie Mellon University 2Together AI 3Yandex 4National Research University Higher School of Economics 5FAIR, Meta |
| Pseudocode | Yes | Algorithm 1 SEQUOIA Dynamic program... Algorithm 2 SEQUOIA Sampling and Verification |
| Open Source Code | Yes | The code is available at https://github.com/Infini-AI-Lab/Sequoia. |
| Open Datasets | Yes | We evaluate our results on C4(en) [35] validation dataset, Open Web Text [14], CNN Daily Mail [36] and MT Bench [52]. |
| Dataset Splits | No | The paper describes using data for measuring acceptance rates and evaluation, but does not specify traditional train/validation/test dataset splits for its own experimental setup, as its focus is on inference acceleration rather than model training. |
| Hardware Specification | Yes | We show that SEQUOIA achieves up to 4.04 speedup for Llama2-7B on a single A100 GPU and 9.5 for Llama3-70B-Instruct in the offloading setting on an L40 GPU... We evaluate SEQUOIA on different hardware including on-device experiments on L40 and A100(-PCIE 80GB) GPUs, as well as offloading experiments on an L40 GPU (with PCIE4). |
| Software Dependencies | Yes | We implement SEQUOIA on top of Hugging Face [45] with CUDA Graphs [31, 32]... To accelerate sampling without replacement which is not efficient in Py Torch 2.1 [32] we use the exponential-sort algorithm [44], combined with Py Torch CUDA graphs [31, 32]. |
| Experiment Setup | Yes | The prompt length and generation length are both set to 128 tokens except MT Bench. We evaluate SEQUOIA on different hardware including on-device experiments on L40 and A100(-PCIE 80GB) GPUs, as well as offloading experiments on an L40 GPU (with PCIE4). |