Speculative Decoding with Big Little Decoder

Authors: Sehoon Kim, Karttikeya Mangalam, Suhong Moon, Jitendra Malik, Michael W. Mahoney, Amir Gholami, Kurt Keutzer

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To evaluate our framework across different tasks and models, we apply Bi LD to various text generation scenarios encompassing machine translation on IWSLT 2017 De-En and WMT 2014 De-En, and summarization on XSUM and CNN/Daily Mail. On an NVIDIA T4 GPU, our framework achieves a speedup of up to 2.12 speedup with minimal generation quality degradation.
Researcher Affiliation Academia 1University of California, Berkeley 2ICSI 3LBNL
Pseudocode Yes Algorithm 1: Big Little Decoder
Open Source Code Yes Our code is open-sourced1. 1https://github.com/kssteven418/Big Little Decoder
Open Datasets Yes To evaluate our framework across different tasks and models, we apply Bi LD to various text generation scenarios encompassing machine translation on IWSLT 2017 De-En and WMT 2014 De-En, and summarization on XSUM and CNN/Daily Mail.
Dataset Splits Yes Figure 2 plots the text generation quality on the validation dataset of each benchmark for different proportions of the large model s engagement.
Hardware Specification Yes On an NVIDIA T4 GPU, our framework achieves a speedup of up to 2.12 speedup with minimal generation quality degradation. [...] All inference evaluations are conducted on a single NVIDIA T4 GPU of a GCP n1-standard-4 instance, using a batch size 1, which is a common use case for online serving [51].
Software Dependencies No The paper states: 'Our framework is built on top of Py Torch [45] and the Hugging Face Transformers library [73] along with their pre-trained checkpoints.' However, it does not provide specific version numbers for these software components.
Experiment Setup Yes All the models are fine-tuned from the pre-trained checkpoints of the Hugging Face library [73] for 500k steps using a batch size of 16. We use Adafactor optimizer [53] with constant learning rate of {0.5, 1, 2, 5}e 4 for the small models and {0.5, 1}e 4 for the large models. [...] For the machine translation tasks, we use fallback thresholds in [0.5, 0.9] and rollback thresholds in [1, 10]. For the summarization tasks, fallback thresholds in [0.2, 0.6] and rollback thresholds in [2, 6]. We keep the maximum generation length of the small model to 10 to avoid high rollback costs.