Deep Fusing Pre-trained Models into Neural Machine Translation

Authors: Rongxiang Weng, Heng Yu, Weihua Luo, Min Zhang11468-11476

AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results show that our approach achieves considerable improvements on WMT14 En2De, WMT14 En2Fr, and WMT16 Ro2En translation benchmarks and outperforms previous work in both autoregressive and non-autoregressive NMT models.
Researcher Affiliation Collaboration Rongxiang Weng1,2, Heng Yu2, Weihua Luo2, Min Zhang1 1School of Computer Science and Technology, Soochow University, Suzhou, China 2Machine Intelligence Technology Lab, Alibaba Group, Hangzhou, China
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code No The paper does not provide concrete access to its own source code, either through a specific repository link or an explicit code release statement.
Open Datasets Yes We evaluate our approach on three WMT translation tasks2, including WMT 14 English to German (En De), WMT14 English to French (En Fr) and WMT16 Romanian to English (Ro En). Here, the En De is the most widely used benchmark in machine translation. The En Fr has the most training data of any public dataset. And the Ro En could be treated as a low-resource language pair. Following previous work (Vaswani et al. 2017; Caswell, Chelba, and Grangier 2019), on the En De task, the training set has about 4.5M sentence pairs. We use newstest2013 as validation set which has 3000 sentence pairs, and newstest2014 as test set which has 3003 sentence pairs. On the En Fr task, our training set has about 36M sentence pairs. We use newstest2013 as validation set which has 3000 sentence pairs, and newstest2014 as test set which has 3003 sentence pairs. On the Ro En task, our training set has about 0.6M sentence pairs. We use newstest2015 as validation set which has 2000 sentence pairs, and newstest2016 as test set which has 2000 sentence pairs.
Dataset Splits Yes We use newstest2013 as validation set which has 3000 sentence pairs, and newstest2014 as test set which has 3003 sentence pairs. On the En Fr task, our training set has about 36M sentence pairs. We use newstest2013 as validation set which has 3000 sentence pairs, and newstest2014 as test set which has 3003 sentence pairs. On the Ro En task, our training set has about 0.6M sentence pairs. We use newstest2015 as validation set which has 2000 sentence pairs, and newstest2016 as test set which has 2000 sentence pairs.
Hardware Specification Yes All experiments are conducted on 8 V100 GPUs, and we accumulate the gradient 4 iterations in En De and En Fr tasks.
Software Dependencies No The paper mentions software like "tensor2tensor", "cased multilingual BERT", "Adam", and "multi-bleu.perl" but does not provide specific version numbers for these software components used in their implementation.
Experiment Setup Yes We use label smoothing with the value 0.1 and dropout with the rate of 0.1. Adam (Kingma and Ba 2014) is used to update parameters, and the learning rate is set as 0.0001. The batch size is set as 64 and the max sentence length is limited to 80. All experiments are conducted on 8 V100 GPUs, and we accumulate the gradient 4 iterations in En De and En Fr tasks. After the training stage, we use beam search as the decoding algorithm, and the beam size is set as 4.