Sequoia: Scalable and Robust Speculative Decoding

Authors: Zhuoming Chen, Avner May, Ruslan Svirschevski, Yu-Hsun Huang, Max Ryabinin, Zhihao Jia, Beidi Chen

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In Section 4, we perform extensive end-to-end experiments and ablation studies to demonstrate the effectiveness of SEQUOIA. We implement SEQUOIA on top of Hugging Face [45] with CUDA Graphs [31, 32]. We show that SEQUOIA achieves up to 4.04 speedup for Llama2-7B on a single A100 GPU and 9.5 for Llama3-70B-Instruct in the offloading setting on an L40 GPU.
Researcher Affiliation Collaboration Zhuoming Chen1 Avner May2 Ruslan Svirschevski3,4 Yuhsun Huang1 Max Ryabinin2 Zhihao Jia1 Beidi Chen1,5 1Carnegie Mellon University 2Together AI 3Yandex 4National Research University Higher School of Economics 5FAIR, Meta
Pseudocode Yes Algorithm 1 SEQUOIA Dynamic program... Algorithm 2 SEQUOIA Sampling and Verification
Open Source Code Yes The code is available at https://github.com/Infini-AI-Lab/Sequoia.
Open Datasets Yes We evaluate our results on C4(en) [35] validation dataset, Open Web Text [14], CNN Daily Mail [36] and MT Bench [52].
Dataset Splits No The paper describes using data for measuring acceptance rates and evaluation, but does not specify traditional train/validation/test dataset splits for its own experimental setup, as its focus is on inference acceleration rather than model training.
Hardware Specification Yes We show that SEQUOIA achieves up to 4.04 speedup for Llama2-7B on a single A100 GPU and 9.5 for Llama3-70B-Instruct in the offloading setting on an L40 GPU... We evaluate SEQUOIA on different hardware including on-device experiments on L40 and A100(-PCIE 80GB) GPUs, as well as offloading experiments on an L40 GPU (with PCIE4).
Software Dependencies Yes We implement SEQUOIA on top of Hugging Face [45] with CUDA Graphs [31, 32]... To accelerate sampling without replacement which is not efficient in Py Torch 2.1 [32] we use the exponential-sort algorithm [44], combined with Py Torch CUDA graphs [31, 32].
Experiment Setup Yes The prompt length and generation length are both set to 128 tokens except MT Bench. We evaluate SEQUOIA on different hardware including on-device experiments on L40 and A100(-PCIE 80GB) GPUs, as well as offloading experiments on an L40 GPU (with PCIE4).