reproducibilityindex.ai

Speculative Decoding with Big Little Decoder

Authors: Sehoon Kim, Karttikeya Mangalam, Suhong Moon, Jitendra Malik, Michael W. Mahoney, Amir Gholami, Kurt Keutzer

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To evaluate our framework across different tasks and models, we apply Bi LD to various text generation scenarios encompassing machine translation on IWSLT 2017 De-En and WMT 2014 De-En, and summarization on XSUM and CNN/Daily Mail. On an NVIDIA T4 GPU, our framework achieves a speedup of up to 2.12 speedup with minimal generation quality degradation.
Researcher Affiliation	Academia	1University of California, Berkeley 2ICSI 3LBNL
Pseudocode	Yes	Algorithm 1: Big Little Decoder
Open Source Code	Yes	Our code is open-sourced1. 1https://github.com/kssteven418/Big Little Decoder
Open Datasets	Yes	To evaluate our framework across different tasks and models, we apply Bi LD to various text generation scenarios encompassing machine translation on IWSLT 2017 De-En and WMT 2014 De-En, and summarization on XSUM and CNN/Daily Mail.
Dataset Splits	Yes	Figure 2 plots the text generation quality on the validation dataset of each benchmark for different proportions of the large model s engagement.
Hardware Specification	Yes	On an NVIDIA T4 GPU, our framework achieves a speedup of up to 2.12 speedup with minimal generation quality degradation. [...] All inference evaluations are conducted on a single NVIDIA T4 GPU of a GCP n1-standard-4 instance, using a batch size 1, which is a common use case for online serving [51].
Software Dependencies	No	The paper states: 'Our framework is built on top of Py Torch [45] and the Hugging Face Transformers library [73] along with their pre-trained checkpoints.' However, it does not provide specific version numbers for these software components.
Experiment Setup	Yes	All the models are fine-tuned from the pre-trained checkpoints of the Hugging Face library [73] for 500k steps using a batch size of 16. We use Adafactor optimizer [53] with constant learning rate of {0.5, 1, 2, 5}e 4 for the small models and {0.5, 1}e 4 for the large models. [...] For the machine translation tasks, we use fallback thresholds in [0.5, 0.9] and rollback thresholds in [1, 10]. For the summarization tasks, fallback thresholds in [0.2, 0.6] and rollback thresholds in [2, 6]. We keep the maximum generation length of the small model to 10 to avoid high rollback costs.