Non-Autoregressive Neural Machine Translation

Authors: Jiatao Gu, James Bradbury, Caiming Xiong, Victor O.K. Li, Richard Socher

ICLR 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 5 EXPERIMENTS: We evaluate the proposed NAT on three widely used public machine translation corpora... Table 1: BLEU scores on official test sets... 5.3 ABLATION STUDY
Researcher Affiliation Collaboration Salesforce Research {james.bradbury,cxiong,rsocher}@salesforce.com The University of Hong Kong {jiataogu, vli}@eee.hku.hk
Pseudocode No The paper provides architectural diagrams and mathematical formulations but does not include structured pseudocode or algorithm blocks.
Open Source Code Yes Implementation We have open-sourced our Py Torch implementation of the NAT6. 6https://github.com/salesforce/nonauto-nmt
Open Datasets Yes Dataset We evaluate the proposed NAT on three widely used public machine translation corpora: IWSLT16 En De2, WMT14 En De,3, and WMT16 En Ro4. 2https://wit3.fbk.eu/ 3http://www.statmt.org/wmt14/translation-task 4http://www.statmt.org/wmt16/translation-task
Dataset Splits Yes We use IWSLT which is smaller than the other two datasets as the development dataset for ablation experiments, and additionally train and test our primary models on both directions of both WMT datasets. Table 1: BLEU scores on official test sets (newstest2014 for WMT En-De and newstest2016 for WMT En-Ro) or the development set for IWSLT.
Hardware Specification Yes Latency is computed as the time to decode a single sentence without minibatching, averaged over the whole test set; decoding is implemented in Py Torch on a single NVIDIA Tesla P100.
Software Dependencies No The paper mentions using 'Py Torch' for implementation but does not specify its version number or other software dependencies with their versions.
Experiment Setup Yes Hyperparameters For experiments on WMT datasets, we use the hyperparameter settings of the base Transformer model described in Vaswani et al. (2017), though without label smoothing. As IWSLT is a smaller corpus, and to reduce training time, we use a set of smaller hyperparameters (dmodel = 287, dhidden = 507, nlayer = 5, nhead = 2, and twarmup = 746) for all experiments on that dataset. For fine-tuning we use λ = 0.25.