reproducibilityindex.ai

SpecTr: Fast Speculative Decoding via Optimal Transport

Authors: Ziteng Sun, Ananda Theertha Suresh, Jae Hun Ro, Ahmad Beirami, Himanshu Jain, Felix Yu

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We experimentally demonstrate that for state-of-the-art large language models, the proposed approach achieves a wall clock speedup of 2.13X, a further 1.37X speedup over speculative decoding on standard benchmarks.
Researcher Affiliation	Industry	Ziteng Sun Google Research, New York zitengsun@google.com Ananda Theertha Suresh Google Research, New York theertha@google.com Jae Hun Ro Google Research, New York jaero@google.com Ahmad Beirami Google Research, New York beirami@google.com Himanshu Jain Google Research, New York himj@google.com Felix Yu Google Research, New York felixyu@google.com
Pseudocode	Yes	Algorithm 1 Token-level maximal coupling; Algorithm 2 k-sequential selection algorithm (K-SEQ); Algorithm 3 Draft selection with multiple candidates (Draft Selection).
Open Source Code	No	The paper does not provide a direct link to open-source code for its described methodology, nor does it explicitly state that the code will be released.
Open Datasets	Yes	LM1B) [3] . In Appendix E, we use a pair of smaller transformer models to break down different affecting factors mentioned above. In Table 1, we use PALM-2-Gecko and PALM-2-Bison as the small model and large model, respectively [13, 12]. The wall clock speedup is normalized by the wall clock latency of baseline autoregressive decoding.
Dataset Splits	No	The paper mentions using the LM1B dataset for training and refers to 'test prompts' but does not provide specific details on the train, validation, and test splits (e.g., percentages or exact counts) needed for reproduction.
Hardware Specification	No	The paper generally mentions 'TPUs and GPUs' or 'on GPU' for running experiments but does not specify exact hardware models, such as particular GPU or CPU types (e.g., 'NVIDIA A100' or 'Intel Xeon').
Software Dependencies	No	The paper mentions the 'FLAX library [15]' in Appendix E as being used for training, but it does not specify a version number for FLAX or any other software dependencies.
Experiment Setup	Yes	All results are over 1000 test prompts averaged over three different random seeds and sampling temperature of 1.0 for both the draft and large models.