Minimizing the Bag-of-Ngrams Difference for Non-Autoregressive Neural Machine Translation

Authors: Chenze Shao, Jinchao Zhang, Yang Feng, Fandong Meng, Jie Zhou198-205

AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate the proposed method on three translation tasks (IWSLT16 En De, WMT14 En De, WMT16 En Ro). Experimental results show that the fine-tuning method achieves large improvements over the pre-trained NAT baseline, and the joint training method further brings considerable improvements over the fine-tuning method, which outperforms the NAT baseline by about 5.0 BLEU scores on WMT14 En De and about 2.5 BLEU scores on WMT16 En Ro.
Researcher Affiliation Collaboration Chenze Shao,1,2 Jinchao Zhang,3 Yang Feng,1,2 Fandong Meng,3 Jie Zhou3 1Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences (ICT/CAS) 2University of Chinese Academy of Sciences 3Pattern Recognition Center, We Chat AI, Tencent Inc, China
Pseudocode Yes Algorithm 1 Bo N-L1 Input: model parameters θ, input sentence X, reference sentence ˆY , prediction length T, n Output: Bo N precision Bo N-p
Open Source Code Yes Reproducible code: https://github.com/ictnlp/Bo N-NAT.
Open Datasets Yes Datasets We use several widely adopted benchmark datasets to evaluate the effectiveness of our proposed method: IWSLT16 En De (196k pairs), WMT14 En De (4.5M pairs) and WMT16 En Ro (610k pairs). For WMT14 En De, we employ newstest-2013 and newstest-2014 as development and test sets. For WMT16 En Ro, we take newsdev-2016 and newstest-2016 as development and test sets. For IWSLT16 En De, we use the test2013 for validation. We use the preprocessed datasets released by Lee, Mansimov, and Cho (2018), where all sentences are tokenized and segmented into subwords units (Sennrich, Haddow, and Birch 2016).
Dataset Splits Yes For WMT14 En De, we employ newstest-2013 and newstest-2014 as development and test sets. For WMT16 En Ro, we take newsdev-2016 and newstest-2016 as development and test sets. For IWSLT16 En De, we use the test2013 for validation.
Hardware Specification Yes The training and decoding speed are measured on a single Geforce GTX TITAN X.
Software Dependencies No The paper mentions using "Adam" for optimization, but does not provide specific version numbers for software dependencies like programming languages, libraries, or frameworks.
Experiment Setup Yes For IWSLT16 En De, we use the small Transformer (dmodel=278, dhidden=507, nlayer=5, nhead=2, pdropout=0.1, twarmup=746). For experiments on WMT datasets, we use the base Transformer (Vaswani et al. 2017) (dmodel=512, dhidden=512, nlayer=6, nhead=8, pdropout=0.1, twarmup=16000). We use Adam (Kingma and Ba 2014) for the optimization. In the main experiment, the hyper-parameter α to combine the Bo N objective and cross-entropy loss set to be 0.1. We set n=2, that is, we use the bag-of-2grams objective to train the model.