Non-Autoregressive Neural Machine Translation
Authors: Jiatao Gu, James Bradbury, Caiming Xiong, Victor O.K. Li, Richard Socher
ICLR 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 5 EXPERIMENTS: We evaluate the proposed NAT on three widely used public machine translation corpora... Table 1: BLEU scores on official test sets... 5.3 ABLATION STUDY |
| Researcher Affiliation | Collaboration | Salesforce Research {james.bradbury,cxiong,rsocher}@salesforce.com The University of Hong Kong {jiataogu, vli}@eee.hku.hk |
| Pseudocode | No | The paper provides architectural diagrams and mathematical formulations but does not include structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Implementation We have open-sourced our Py Torch implementation of the NAT6. 6https://github.com/salesforce/nonauto-nmt |
| Open Datasets | Yes | Dataset We evaluate the proposed NAT on three widely used public machine translation corpora: IWSLT16 En De2, WMT14 En De,3, and WMT16 En Ro4. 2https://wit3.fbk.eu/ 3http://www.statmt.org/wmt14/translation-task 4http://www.statmt.org/wmt16/translation-task |
| Dataset Splits | Yes | We use IWSLT which is smaller than the other two datasets as the development dataset for ablation experiments, and additionally train and test our primary models on both directions of both WMT datasets. Table 1: BLEU scores on official test sets (newstest2014 for WMT En-De and newstest2016 for WMT En-Ro) or the development set for IWSLT. |
| Hardware Specification | Yes | Latency is computed as the time to decode a single sentence without minibatching, averaged over the whole test set; decoding is implemented in Py Torch on a single NVIDIA Tesla P100. |
| Software Dependencies | No | The paper mentions using 'Py Torch' for implementation but does not specify its version number or other software dependencies with their versions. |
| Experiment Setup | Yes | Hyperparameters For experiments on WMT datasets, we use the hyperparameter settings of the base Transformer model described in Vaswani et al. (2017), though without label smoothing. As IWSLT is a smaller corpus, and to reduce training time, we use a set of smaller hyperparameters (dmodel = 287, dhidden = 507, nlayer = 5, nhead = 2, and twarmup = 746) for all experiments on that dataset. For fine-tuning we use λ = 0.25. |