Minimizing the Bag-of-Ngrams Difference for Non-Autoregressive Neural Machine Translation
Authors: Chenze Shao, Jinchao Zhang, Yang Feng, Fandong Meng, Jie Zhou198-205
AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate the proposed method on three translation tasks (IWSLT16 En De, WMT14 En De, WMT16 En Ro). Experimental results show that the fine-tuning method achieves large improvements over the pre-trained NAT baseline, and the joint training method further brings considerable improvements over the fine-tuning method, which outperforms the NAT baseline by about 5.0 BLEU scores on WMT14 En De and about 2.5 BLEU scores on WMT16 En Ro. |
| Researcher Affiliation | Collaboration | Chenze Shao,1,2 Jinchao Zhang,3 Yang Feng,1,2 Fandong Meng,3 Jie Zhou3 1Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences (ICT/CAS) 2University of Chinese Academy of Sciences 3Pattern Recognition Center, We Chat AI, Tencent Inc, China |
| Pseudocode | Yes | Algorithm 1 Bo N-L1 Input: model parameters θ, input sentence X, reference sentence ˆY , prediction length T, n Output: Bo N precision Bo N-p |
| Open Source Code | Yes | Reproducible code: https://github.com/ictnlp/Bo N-NAT. |
| Open Datasets | Yes | Datasets We use several widely adopted benchmark datasets to evaluate the effectiveness of our proposed method: IWSLT16 En De (196k pairs), WMT14 En De (4.5M pairs) and WMT16 En Ro (610k pairs). For WMT14 En De, we employ newstest-2013 and newstest-2014 as development and test sets. For WMT16 En Ro, we take newsdev-2016 and newstest-2016 as development and test sets. For IWSLT16 En De, we use the test2013 for validation. We use the preprocessed datasets released by Lee, Mansimov, and Cho (2018), where all sentences are tokenized and segmented into subwords units (Sennrich, Haddow, and Birch 2016). |
| Dataset Splits | Yes | For WMT14 En De, we employ newstest-2013 and newstest-2014 as development and test sets. For WMT16 En Ro, we take newsdev-2016 and newstest-2016 as development and test sets. For IWSLT16 En De, we use the test2013 for validation. |
| Hardware Specification | Yes | The training and decoding speed are measured on a single Geforce GTX TITAN X. |
| Software Dependencies | No | The paper mentions using "Adam" for optimization, but does not provide specific version numbers for software dependencies like programming languages, libraries, or frameworks. |
| Experiment Setup | Yes | For IWSLT16 En De, we use the small Transformer (dmodel=278, dhidden=507, nlayer=5, nhead=2, pdropout=0.1, twarmup=746). For experiments on WMT datasets, we use the base Transformer (Vaswani et al. 2017) (dmodel=512, dhidden=512, nlayer=6, nhead=8, pdropout=0.1, twarmup=16000). We use Adam (Kingma and Ba 2014) for the optimization. In the main experiment, the hyper-parameter α to combine the Bo N objective and cross-entropy loss set to be 0.1. We set n=2, that is, we use the bag-of-2grams objective to train the model. |