Non-Autoregressive Machine Translation with Auxiliary Regularization

Authors: Yiren Wang, Fei Tian, Di He, Tao Qin, ChengXiang Zhai, Tie-Yan Liu5377-5384

AAAI 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments conducted on several benchmark datasets show that both regularization strategies are effective and can alleviate the issues of repeated translations and incomplete translations in NAT models. The accuracy of NAT models is therefore improved significantly over the state-of-the-art NAT models with even better efficiency for inference.
Researcher Affiliation Collaboration 1University of Illinois at Urbana-Champaign, Urbana, IL, USA 2Microsoft Research, Beijing, China 3Key Laboratory of Machine Perception, MOE, School of EECS, Peking University, Beijing, China
Pseudocode No The paper does not include any pseudocode or clearly labeled algorithm blocks.
Open Source Code No The paper mentions using "fairseq recipes1" (https://github.com/pytorch/fairseq) and "official Tensor Flow implementation of Transformer2" (https://github.com/tensorflow/tensor2tensor), which are third-party frameworks, but does not state that its own specific code implementation for the proposed method is open-source or publicly available.
Open Datasets Yes We use several widely adopted benchmark datasets to evaluate the effectiveness of our proposed method: the IWSLT14 German to English translation (IWSLT14 De-En), the IWSLT16 English to German (IWSLT16 En-De), and the WMT14 English to German (WMT14 En-De) and German to English (WMT14 De-En) which share the same dataset.
Dataset Splits Yes For IWSLT16 En-De, we follow the dataset construction in (Gu et al. 2018; Lee, Mansimov, and Cho 2018) with roughly 195k/1k/1k parallel sentences as training/dev/test sets respectively. We use the fairseq recipes1 for preprocessing data and creating dataset split for the rest datasets. The IWSLT14 De-En contains roughly 153k/7k/7k parallel sentences as training/dev/test sets. The WMT14 En-De/De-En dataset is much larger with 4.5M parallel training pairs. We use Newstest2014 as test set and Newstest2013 as dev set.
Hardware Specification Yes We run the training procedure on 8/1 Nvidia M40 GPUs for WMT and IWSLT datasets respectively. Distillation and inference are run on 1 GPU. ...We test latency on 1 NVIDIA Tesla P100 to keep in line with previous works (Gu et al. 2018).
Software Dependencies No Our models are implemented based on the official Tensor Flow implementation of Transformer2. For sequence-level distillation, we set beam size to be 4. The paper mentions software like TensorFlow, PyTorch's fairseq, and Mosesdecoder's multi-bleu script, but does not specify their version numbers.
Experiment Setup Yes We train all models using Adam following the optimizer settings and learning rate schedule in Transformer(Vaswani et al. 2017). ...For our model NAT-REG, we determine the trade-off parameters, i.e, α and β in Eqn. 5 by the BLEU on the IWSLT14 De-En dev set, and use the same values for all other datasets. The optimal values are α = 2 and β = 0.5. ...For WMT datasets, we use the hyperparameter settings of base Transformer model in (Vaswani et al. 2017). For IWSLT14 DE-EN, we use the small Transformer setting with a 5-layer encoder and 5-layer decoder (size of hidden states and embeddings is 256, and the number of attention heads is 4). For IWSLT16 EN-DE, we use a slightly different version of small settings with 5 layers from (Gu et al. 2018), where size of hidden states and embeddings are 278, number of attention heads is 2.