Fast Structured Decoding for Sequence Models
Authors: Zhiqing Sun, Zhuohan Li, Haoqing Wang, Di He, Zi Lin, Zhihong Deng
NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments in machine translation show that while increasing little latency (8 14ms), our model could achieve significantly better translation performance than previous non-autoregressive models on different translation datasets. In particular, for the WMT14 En-De dataset, our model obtains a BLEU score of 26.80, which largely outperforms the previous non-autoregressive baselines and is only 0.61 lower in BLEU than purely autoregressive models. |
| Researcher Affiliation | Academia | Zhiqing Sun1, Zhuohan Li2, Haoqing Wang3 Di He3 Zi Lin3 Zhi-Hong Deng3 1Carnegie Mellon University 2University of California, Berkeley 3Peking University |
| Pseudocode | No | The paper describes algorithms and models in text and diagrams (Figure 1), but does not contain a structured pseudocode or algorithm block. |
| Open Source Code | Yes | The reproducible code can be found at https://github.com/Edward-Sun/structured-nart |
| Open Datasets | Yes | We use several widely adopted benchmark tasks to evaluate the effectiveness of our proposed models: IWSLT143 German-to-English translation (IWSLT14 De-En) and WMT144 English-to German/German-to-English translation (WMT14 En-De/De-En). 3https://wit3.fbk.eu/ 4http://statmt.org/wmt14/translation-task.html |
| Dataset Splits | Yes | For the WMT14 dataset, we use Newstest2014 as test data and Newstest2013 as validation data. |
| Hardware Specification | Yes | Models for WMT14/IWSLT14 tasks are trained on 4/1 NVIDIA P40 GPUs, respectively. [...] we evaluate the average per-sentence decoding latency on WMT14 En-De test sets with batch size 1 with a single NVIDIA Tesla P100 GPU for the Transformer model and the NART models to measure the speedup of our models. |
| Software Dependencies | No | The paper states 'We implement our models based on the open-sourced tensor2tensor library [23]' and 'We use Adam [30] optimizer' but does not provide specific version numbers for these software components. |
| Experiment Setup | Yes | For the WMT14 dataset, we use the default network architecture of the original base Transformer [1], which consists of a 6-layer encoder and 6-layer decoder. The size of hidden states dmodel is set to 512. [...] For all datasets, we set the size of transition embedding dt to 32 and the beam size k of beam approximation to 64. Hyperparameter λ is set to 0.5 to balance the scale of two loss components. [...] We use Adam [30] optimizer and employ label smoothing of value ϵls = 0.1 [31] in all experiments. |