Blockwise Parallel Decoding for Deep Autoregressive Models

Authors: Mitchell Stern, Noam Shazeer, Jakob Uszkoreit

NeurIPS 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We verify our approach empirically through a series of experiments using state-of-the-art self-attention models for machine translation and image super-resolution, achieving iteration reductions of up to 2x over a baseline greedy decoder with no loss in quality, or up to 7x in exchange for a slight decrease in performance.
Researcher Affiliation Collaboration Mitchell Stern University of California, Berkeley mitchell@berkeley.edu Noam Shazeer Google Brain noam@google.com Jakob Uszkoreit Google Brain usz@google.com
Pseudocode Yes We propose the following blockwise parallel decoding algorithm (illustrated in Figure 1), which is guaranteed to produce the same prediction ˆy that would be found under greedy decoding but uses as few as m/k steps. As before, we start with an empty prediction ˆy and set j = 0. Then we repeat the following three substeps until the termination condition is met: Predict: Get the block predictions ˆyj+i = argmaxyj+i pi(yj+i | ˆy j, x) for i = 1, . . . , k. Verify: Find the largest ˆk such that ˆyj+i = argmaxyj+i p1(yj+i | ˆy j+i 1, x) for all 1 i ˆk. Note that ˆk 1 by the definition of ˆyj+1. Accept: Extend ˆy with ˆyj+1, . . . , ˆyj+ˆk and set j j + ˆk.
Open Source Code Yes Our code is publicly available in the open-source Tensor2Tensor library (Vaswani et al., 2018).
Open Datasets Yes For our machine translation experiments, we use the WMT 2014 English-German translation dataset.
Dataset Splits Yes We measure the BLEU score and the mean accepted block size ˆk on the development set under a variety of settings. Results are reported in Table 1.
Hardware Specification Yes Our baseline model is a Transformer trained for 1,000,000 steps on 8 P100 GPUs using the transformer_base hyperparameter set in Tensor2Tensor.
Software Dependencies No The paper states it uses the open-source Tensor2Tensor framework but does not specify its version number or other software dependencies with their versions.
Experiment Setup Yes Our baseline model is a Transformer trained for 1,000,000 steps on 8 P100 GPUs using the transformer_base hyperparameter set in Tensor2Tensor.