Deep Fusing Pre-trained Models into Neural Machine Translation
Authors: Rongxiang Weng, Heng Yu, Weihua Luo, Min Zhang11468-11476
AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results show that our approach achieves considerable improvements on WMT14 En2De, WMT14 En2Fr, and WMT16 Ro2En translation benchmarks and outperforms previous work in both autoregressive and non-autoregressive NMT models. |
| Researcher Affiliation | Collaboration | Rongxiang Weng1,2, Heng Yu2, Weihua Luo2, Min Zhang1 1School of Computer Science and Technology, Soochow University, Suzhou, China 2Machine Intelligence Technology Lab, Alibaba Group, Hangzhou, China |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide concrete access to its own source code, either through a specific repository link or an explicit code release statement. |
| Open Datasets | Yes | We evaluate our approach on three WMT translation tasks2, including WMT 14 English to German (En De), WMT14 English to French (En Fr) and WMT16 Romanian to English (Ro En). Here, the En De is the most widely used benchmark in machine translation. The En Fr has the most training data of any public dataset. And the Ro En could be treated as a low-resource language pair. Following previous work (Vaswani et al. 2017; Caswell, Chelba, and Grangier 2019), on the En De task, the training set has about 4.5M sentence pairs. We use newstest2013 as validation set which has 3000 sentence pairs, and newstest2014 as test set which has 3003 sentence pairs. On the En Fr task, our training set has about 36M sentence pairs. We use newstest2013 as validation set which has 3000 sentence pairs, and newstest2014 as test set which has 3003 sentence pairs. On the Ro En task, our training set has about 0.6M sentence pairs. We use newstest2015 as validation set which has 2000 sentence pairs, and newstest2016 as test set which has 2000 sentence pairs. |
| Dataset Splits | Yes | We use newstest2013 as validation set which has 3000 sentence pairs, and newstest2014 as test set which has 3003 sentence pairs. On the En Fr task, our training set has about 36M sentence pairs. We use newstest2013 as validation set which has 3000 sentence pairs, and newstest2014 as test set which has 3003 sentence pairs. On the Ro En task, our training set has about 0.6M sentence pairs. We use newstest2015 as validation set which has 2000 sentence pairs, and newstest2016 as test set which has 2000 sentence pairs. |
| Hardware Specification | Yes | All experiments are conducted on 8 V100 GPUs, and we accumulate the gradient 4 iterations in En De and En Fr tasks. |
| Software Dependencies | No | The paper mentions software like "tensor2tensor", "cased multilingual BERT", "Adam", and "multi-bleu.perl" but does not provide specific version numbers for these software components used in their implementation. |
| Experiment Setup | Yes | We use label smoothing with the value 0.1 and dropout with the rate of 0.1. Adam (Kingma and Ba 2014) is used to update parameters, and the learning rate is set as 0.0001. The batch size is set as 64 and the max sentence length is limited to 80. All experiments are conducted on 8 V100 GPUs, and we accumulate the gradient 4 iterations in En De and En Fr tasks. After the training stage, we use beam search as the decoding algorithm, and the beam size is set as 4. |