SpecTr: Fast Speculative Decoding via Optimal Transport

Authors: Ziteng Sun, Ananda Theertha Suresh, Jae Hun Ro, Ahmad Beirami, Himanshu Jain, Felix Yu

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We experimentally demonstrate that for state-of-the-art large language models, the proposed approach achieves a wall clock speedup of 2.13X, a further 1.37X speedup over speculative decoding on standard benchmarks.
Researcher Affiliation Industry Ziteng Sun Google Research, New York zitengsun@google.com Ananda Theertha Suresh Google Research, New York theertha@google.com Jae Hun Ro Google Research, New York jaero@google.com Ahmad Beirami Google Research, New York beirami@google.com Himanshu Jain Google Research, New York himj@google.com Felix Yu Google Research, New York felixyu@google.com
Pseudocode Yes Algorithm 1 Token-level maximal coupling; Algorithm 2 k-sequential selection algorithm (K-SEQ); Algorithm 3 Draft selection with multiple candidates (Draft Selection).
Open Source Code No The paper does not provide a direct link to open-source code for its described methodology, nor does it explicitly state that the code will be released.
Open Datasets Yes LM1B) [3] . In Appendix E, we use a pair of smaller transformer models to break down different affecting factors mentioned above. In Table 1, we use PALM-2-Gecko and PALM-2-Bison as the small model and large model, respectively [13, 12]. The wall clock speedup is normalized by the wall clock latency of baseline autoregressive decoding.
Dataset Splits No The paper mentions using the LM1B dataset for training and refers to 'test prompts' but does not provide specific details on the train, validation, and test splits (e.g., percentages or exact counts) needed for reproduction.
Hardware Specification No The paper generally mentions 'TPUs and GPUs' or 'on GPU' for running experiments but does not specify exact hardware models, such as particular GPU or CPU types (e.g., 'NVIDIA A100' or 'Intel Xeon').
Software Dependencies No The paper mentions the 'FLAX library [15]' in Appendix E as being used for training, but it does not specify a version number for FLAX or any other software dependencies.
Experiment Setup Yes All results are over 1000 test prompts averaged over three different random seeds and sampling temperature of 1.0 for both the draft and large models.