Fast Inference from Transformers via Speculative Decoding
Authors: Yaniv Leviathan, Matan Kalman, Yossi Matias
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate it on T5-XXL and show a 2X-3X acceleration compared to the standard T5X implementation, with identical outputs. and 4. Experiments |
| Researcher Affiliation | Industry | 1Google Research, Mountain View, CA, USA. Correspondence to: Yaniv Leviathan <leviathan@google.com>. |
| Pseudocode | Yes | Algorithm 1 Speculative Decoding Step |
| Open Source Code | No | The paper does not contain an explicit statement about releasing source code for the described methodology or a link to a code repository. |
| Open Datasets | Yes | We test a standard encoder-decoder T5 version 1.1 model (Raffel et al., 2020) on two tasks from the T5 paper: (1) English to German translation fine tuned on WMT En De, and (2) Text summarization fine tuned on CCN/DM. and trained on lm1b (Chelba et al., 2013). |
| Dataset Splits | No | The paper mentions fine-tuning on WMT En De, CCN/DM, and lm1b datasets, but it does not provide explicit training, validation, and test dataset splits (e.g., percentages or sample counts), nor does it reference predefined splits with specific citations for reproducibility. |
| Hardware Specification | Yes | We measure walltime improvements with a batch size of 1 on a single TPU-v4 for both argmax sampling (temp=0) and standard sampling (temp=1). |
| Software Dependencies | No | The paper mentions using the 'T5X implementation' and references 'T5 version 1.1 model' and 'Bert tokenization'. However, it does not provide specific version numbers for ancillary software dependencies such as programming languages, libraries (e.g., TensorFlow, PyTorch), or CUDA. |
| Experiment Setup | Yes | We test a standard encoder-decoder T5 version 1.1 model (Raffel et al., 2020) on two tasks from the T5 paper: (1) English to German translation fine tuned on WMT En De, and (2) Text summarization fine tuned on CCN/DM. For both tasks, we use T5-XXL (11B) for Mp. For the approximation model Mq we test several existing configurations, namely T5-large (800M), T5-base (250M), and T5-small (77M) (Raffel et al., 2020). We use existing checkpoints for all models. We measure walltime improvements with a batch size of 1 on a single TPU-v4 for both argmax sampling (temp=0) and standard sampling (temp=1). |