Fast Inference from Transformers via Speculative Decoding

Authors: Yaniv Leviathan, Matan Kalman, Yossi Matias

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate it on T5-XXL and show a 2X-3X acceleration compared to the standard T5X implementation, with identical outputs. and 4. Experiments
Researcher Affiliation Industry 1Google Research, Mountain View, CA, USA. Correspondence to: Yaniv Leviathan <leviathan@google.com>.
Pseudocode Yes Algorithm 1 Speculative Decoding Step
Open Source Code No The paper does not contain an explicit statement about releasing source code for the described methodology or a link to a code repository.
Open Datasets Yes We test a standard encoder-decoder T5 version 1.1 model (Raffel et al., 2020) on two tasks from the T5 paper: (1) English to German translation fine tuned on WMT En De, and (2) Text summarization fine tuned on CCN/DM. and trained on lm1b (Chelba et al., 2013).
Dataset Splits No The paper mentions fine-tuning on WMT En De, CCN/DM, and lm1b datasets, but it does not provide explicit training, validation, and test dataset splits (e.g., percentages or sample counts), nor does it reference predefined splits with specific citations for reproducibility.
Hardware Specification Yes We measure walltime improvements with a batch size of 1 on a single TPU-v4 for both argmax sampling (temp=0) and standard sampling (temp=1).
Software Dependencies No The paper mentions using the 'T5X implementation' and references 'T5 version 1.1 model' and 'Bert tokenization'. However, it does not provide specific version numbers for ancillary software dependencies such as programming languages, libraries (e.g., TensorFlow, PyTorch), or CUDA.
Experiment Setup Yes We test a standard encoder-decoder T5 version 1.1 model (Raffel et al., 2020) on two tasks from the T5 paper: (1) English to German translation fine tuned on WMT En De, and (2) Text summarization fine tuned on CCN/DM. For both tasks, we use T5-XXL (11B) for Mp. For the approximation model Mq we test several existing configurations, namely T5-large (800M), T5-base (250M), and T5-small (77M) (Raffel et al., 2020). We use existing checkpoints for all models. We measure walltime improvements with a batch size of 1 on a single TPU-v4 for both argmax sampling (temp=0) and standard sampling (temp=1).