Fast Decoding in Sequence Models Using Discrete Latent Variables
Authors: Lukasz Kaiser, Samy Bengio, Aurko Roy, Ashish Vaswani, Niki Parmar, Jakob Uszkoreit, Noam Shazeer
ICML 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Finally, we evaluate our model end-to-end on the task of neural machine translation, where it is an order of magnitude faster at decoding than comparable autoregressive models. |
| Researcher Affiliation | Industry | 1Google Brain, Mountain View, California, USA. |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our implementation, together with hyper-parameters and everything needed to reproduce our results is available as open-source1. 1The code is available under redacted. |
| Open Datasets | Yes | We train the Latent Transformer with the base configuration to make it comparable to both the autoregressive baseline (Vaswani et al., 2017) and to the recent non-autoregressive NMT results (Gu et al., 2017)... BLEU scores (the higher the better) on the WMT English German translation task on the newstest2014 test set. |
| Dataset Splits | Yes | BLEU scores (the higher the better) on the WMT English German translation task on the newstest2014 test set. The acronym LT denotes the Latent Transformer from Section 3. Results reported for LT are from this work... Log-perplexities of autoencoder reconstructions on the development set (newstest2013) for different values of n/m and numbers of bits in latent variables (LT trained for 250K steps). |
| Hardware Specification | Yes | decoding is implemented in Tensorflow on a Nvidia Ge Force GTX 1080. |
| Software Dependencies | No | We used around 33K subword units as vocabulary and implemented our model in Tensor Flow (Abadi et al., 2015). |
| Experiment Setup | Yes | In this work we focused on the autoencoding functions and did not tune the Transformer: we used all the defaults from the baseline provided by the Transformer authors (6 layers, hidden size of 512 and filter size of 4096) and only varied parameters relevant to ae and ad, which we describe below. ... λ is a decay parameter which we set to 0.999 in our experiments. ... The optimal number of decompositions for our choice of latent vocabulary size log2 K = 14 and 16 was nd = 2... |