Speculative Decoding with Big Little Decoder
Authors: Sehoon Kim, Karttikeya Mangalam, Suhong Moon, Jitendra Malik, Michael W. Mahoney, Amir Gholami, Kurt Keutzer
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To evaluate our framework across different tasks and models, we apply Bi LD to various text generation scenarios encompassing machine translation on IWSLT 2017 De-En and WMT 2014 De-En, and summarization on XSUM and CNN/Daily Mail. On an NVIDIA T4 GPU, our framework achieves a speedup of up to 2.12 speedup with minimal generation quality degradation. |
| Researcher Affiliation | Academia | 1University of California, Berkeley 2ICSI 3LBNL |
| Pseudocode | Yes | Algorithm 1: Big Little Decoder |
| Open Source Code | Yes | Our code is open-sourced1. 1https://github.com/kssteven418/Big Little Decoder |
| Open Datasets | Yes | To evaluate our framework across different tasks and models, we apply Bi LD to various text generation scenarios encompassing machine translation on IWSLT 2017 De-En and WMT 2014 De-En, and summarization on XSUM and CNN/Daily Mail. |
| Dataset Splits | Yes | Figure 2 plots the text generation quality on the validation dataset of each benchmark for different proportions of the large model s engagement. |
| Hardware Specification | Yes | On an NVIDIA T4 GPU, our framework achieves a speedup of up to 2.12 speedup with minimal generation quality degradation. [...] All inference evaluations are conducted on a single NVIDIA T4 GPU of a GCP n1-standard-4 instance, using a batch size 1, which is a common use case for online serving [51]. |
| Software Dependencies | No | The paper states: 'Our framework is built on top of Py Torch [45] and the Hugging Face Transformers library [73] along with their pre-trained checkpoints.' However, it does not provide specific version numbers for these software components. |
| Experiment Setup | Yes | All the models are fine-tuned from the pre-trained checkpoints of the Hugging Face library [73] for 500k steps using a batch size of 16. We use Adafactor optimizer [53] with constant learning rate of {0.5, 1, 2, 5}e 4 for the small models and {0.5, 1}e 4 for the large models. [...] For the machine translation tasks, we use fallback thresholds in [0.5, 0.9] and rollback thresholds in [1, 10]. For the summarization tasks, fallback thresholds in [0.2, 0.6] and rollback thresholds in [2, 6]. We keep the maximum generation length of the small model to 10 to avoid high rollback costs. |