Guiding Non-Autoregressive Neural Machine Translation Decoding with Reordering Information

Authors: Qiu Ran, Yankai Lin, Peng Li, Jie Zhou13727-13735

AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results on various widely-used datasets show that our proposed model achieves better performance compared to most existing NAT models, and even achieves comparable translation quality as autoregressive translation models with a significant speedup.
Researcher Affiliation Industry Qiu Ran , Yankai Lin , Peng Li , Jie Zhou Pattern Recognition Center, We Chat AI, Tencent Inc., China {soulcaptran,yankailin,patrickpli,withtomzhou}@tencent.com
Pseudocode No The paper describes the method using prose and a diagram, but it does not include pseudocode or an algorithm block.
Open Source Code Yes The source codes are available at https://github.com/ranqiu92/Reorder NAT.
Open Datasets Yes The main experiments are conducted on three widely-used machine translation tasks: WMT14 En-De (4.5M pairs), WMT16 En-Ro (610k pairs) and IWSLT16 En-De (196k pairs). ... We use the prepossessed corpus provided by Lee, Mansimov, and Cho (2018) at https://github.com/nyu-dl/dl4mt-nonauto/tree/multigpu. ... The training set consists of 1.25M sentence pairs extracted from the LDC corpora.
Dataset Splits Yes For WMT14 En-De task, we take newstest-2013 and newstest-2014 as validation and test sets respectively. For WMT16 En-Ro task, we employ newsdev-2016 and newstest-2016 as validation and test sets respectively. For IWSLT16 En-De task, we use test2013 for validation. ... We use NIST 2002 (MT02) as validation set, and NIST 2003 (MT03), 2004 (MT04), 2005 (MT05) as test sets.
Hardware Specification Yes We measure the model inference speedup on the validation set of IWSLT16 En-De task with a NVIDIA P40 GPU and set batch size to 1.
Software Dependencies No The paper mentions using the 'fast align tool' but does not provide specific version numbers for it or any other software dependencies.
Experiment Setup Yes For IWSLT16 En-De, we use a 5-layer Transformer model (dmodel = 278, dhidden = 507, nhead = 2, pdropout = 0.1) and anneal the learning rate linearly (from 3 × 10−4 to 10−5) as in (Lee, Mansimov, and Cho 2018). For WMT14 En-De, WMT16 En-Ro and Chinese-English translation, we use a 6-layer Transformer model (dmodel = 512, dhidden = 512, nhead = 8, pdropout = 0.1) and adopt the warm-up learning rate schedule (Vaswani et al. 2017) with twarmup = 4000. For the GRU reordering module, we set it to have the same hidden size with the Transformer model in each dataset. We employ label smoothing of value ϵls = 0.15 and utilize the sequence-level knowledge distillation (Kim and Rush 2016). We also set T in Eq. 10 to 0.2 according to a grid search on the validation set. We set the beam size to 4 in the experiments.