Fast Structured Decoding for Sequence Models

Authors: Zhiqing Sun, Zhuohan Li, Haoqing Wang, Di He, Zi Lin, Zhihong Deng

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments in machine translation show that while increasing little latency (8 14ms), our model could achieve significantly better translation performance than previous non-autoregressive models on different translation datasets. In particular, for the WMT14 En-De dataset, our model obtains a BLEU score of 26.80, which largely outperforms the previous non-autoregressive baselines and is only 0.61 lower in BLEU than purely autoregressive models.
Researcher Affiliation Academia Zhiqing Sun1, Zhuohan Li2, Haoqing Wang3 Di He3 Zi Lin3 Zhi-Hong Deng3 1Carnegie Mellon University 2University of California, Berkeley 3Peking University
Pseudocode No The paper describes algorithms and models in text and diagrams (Figure 1), but does not contain a structured pseudocode or algorithm block.
Open Source Code Yes The reproducible code can be found at https://github.com/Edward-Sun/structured-nart
Open Datasets Yes We use several widely adopted benchmark tasks to evaluate the effectiveness of our proposed models: IWSLT143 German-to-English translation (IWSLT14 De-En) and WMT144 English-to German/German-to-English translation (WMT14 En-De/De-En). 3https://wit3.fbk.eu/ 4http://statmt.org/wmt14/translation-task.html
Dataset Splits Yes For the WMT14 dataset, we use Newstest2014 as test data and Newstest2013 as validation data.
Hardware Specification Yes Models for WMT14/IWSLT14 tasks are trained on 4/1 NVIDIA P40 GPUs, respectively. [...] we evaluate the average per-sentence decoding latency on WMT14 En-De test sets with batch size 1 with a single NVIDIA Tesla P100 GPU for the Transformer model and the NART models to measure the speedup of our models.
Software Dependencies No The paper states 'We implement our models based on the open-sourced tensor2tensor library [23]' and 'We use Adam [30] optimizer' but does not provide specific version numbers for these software components.
Experiment Setup Yes For the WMT14 dataset, we use the default network architecture of the original base Transformer [1], which consists of a 6-layer encoder and 6-layer decoder. The size of hidden states dmodel is set to 512. [...] For all datasets, we set the size of transition embedding dt to 32 and the beam size k of beam approximation to 64. Hyperparameter λ is set to 0.5 to balance the scale of two loss components. [...] We use Adam [30] optimizer and employ label smoothing of value ϵls = 0.1 [31] in all experiments.