Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Speculative Decoding with Big Little Decoder
Authors: Sehoon Kim, Karttikeya Mangalam, Suhong Moon, Jitendra Malik, Michael W. Mahoney, Amir Gholami, Kurt Keutzer
NeurIPS 2023 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To evaluate our framework across different tasks and models, we apply Bi LD to various text generation scenarios encompassing machine translation on IWSLT 2017 De-En and WMT 2014 De-En, and summarization on XSUM and CNN/Daily Mail. On an NVIDIA T4 GPU, our framework achieves a speedup of up to 2.12 speedup with minimal generation quality degradation. |
| Researcher Affiliation | Academia | 1University of California, Berkeley 2ICSI 3LBNL |
| Pseudocode | Yes | Algorithm 1: Big Little Decoder |
| Open Source Code | Yes | Our code is open-sourced1. 1https://github.com/kssteven418/Big Little Decoder |
| Open Datasets | Yes | To evaluate our framework across different tasks and models, we apply Bi LD to various text generation scenarios encompassing machine translation on IWSLT 2017 De-En and WMT 2014 De-En, and summarization on XSUM and CNN/Daily Mail. |
| Dataset Splits | Yes | Figure 2 plots the text generation quality on the validation dataset of each benchmark for different proportions of the large model s engagement. |
| Hardware Specification | Yes | On an NVIDIA T4 GPU, our framework achieves a speedup of up to 2.12 speedup with minimal generation quality degradation. [...] All inference evaluations are conducted on a single NVIDIA T4 GPU of a GCP n1-standard-4 instance, using a batch size 1, which is a common use case for online serving [51]. |
| Software Dependencies | No | The paper states: 'Our framework is built on top of Py Torch [45] and the Hugging Face Transformers library [73] along with their pre-trained checkpoints.' However, it does not provide specific version numbers for these software components. |
| Experiment Setup | Yes | All the models are fine-tuned from the pre-trained checkpoints of the Hugging Face library [73] for 500k steps using a batch size of 16. We use Adafactor optimizer [53] with constant learning rate of {0.5, 1, 2, 5}e 4 for the small models and {0.5, 1}e 4 for the large models. [...] For the machine translation tasks, we use fallback thresholds in [0.5, 0.9] and rollback thresholds in [1, 10]. For the summarization tasks, fallback thresholds in [0.2, 0.6] and rollback thresholds in [2, 6]. We keep the maximum generation length of the small model to 10 to avoid high rollback costs. |