Incorporating BERT into Neural Machine Translation

Authors: Jinhua Zhu, Yingce Xia, Lijun Wu, Di He, Tao Qin, Wengang Zhou, Houqiang Li, Tieyan Liu

ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct experiments on supervised (including sentence-level and document-level translations), semi-supervised and unsupervised machine translation, and achieve state-of-the-art results on seven benchmark datasets. Our code is available at https://github.com/bert-nmt/bert-nmt. We conduct 14 experiments on various NMT tasks to verify our approach, including supervised, semi-supervised and unsupervised settings. We conduct two groups of ablation studies on IWSLT 14 En De translation to better understand our model.
Researcher Affiliation Collaboration Jinhua Zhu1, , Yingce Xia2, , Lijun Wu3, Di He4, Tao Qin2, Wengang Zhou1, Houqiang Li1, Tie-Yan Liu2 1CAS Key Laboratory of GIPAS, EEIS Department, University of Science and Technology of China; 2Microsoft Research; 3Sun Yat-sen University; 4Key Laboratory of Machine Perception (MOE), School of EECS, Peking University
Pseudocode No No explicit pseudocode or algorithm blocks were found. The algorithm is described textually with steps and mathematical formulations in Section 4.1, but not in a structured pseudocode format.
Open Source Code Yes Our code is available at https://github.com/bert-nmt/bert-nmt.
Open Datasets Yes Dataset For the low-resource scenario, we choose IWSLT 14 English German (En De), English Spanish (En Es), IWSLT 17 English French (En Fr) and English Chinese (En Zh) translation. There are 160k, 183k, 236k, 235k bilingual sentence pairs for En De, En Es, En Fr and En Zh tasks. Following the common practice (Edunov et al., 2018), for En De, we lowercase all words, split 7k sentence pairs from the training dataset for validation and concatenate dev2010, dev2012, tst2010, tst2011, tst2012 as the test set. For other tasks, we do not lowercase the words and use the official validation/test sets of the corresponding years. For rich-resource scenario, we work on WMT 14 En De and En Fr, whose corpus sizes are 4.5M and 36M respectively. We concatenate newstest2012 and newstest2013 as the validation set and use newstest2014 as the test set. For unsupervised En Fr, we use 190M monolingual English sentences and 62M monolingual French sentences from WMT News Crawl datasets... For unsupervised En Ro translation, we use 50M English sentences from News Crawl... and collect 2.9M sentences for Romanian by concatenating News Crawl data sets and WMT 16 Romanian monolingual data following Lample et al. (2018).
Dataset Splits Yes For En De, we lowercase all words, split 7k sentence pairs from the training dataset for validation and concatenate dev2010, dev2012, tst2010, tst2011, tst2012 as the test set. For other tasks, we do not lowercase the words and use the official validation/test sets of the corresponding years. We concatenate newstest2012 and newstest2013 as the validation set and use newstest2014 as the test set. We use newsdev2016 as validation set and newstest2016 as test set.
Hardware Specification Yes Experiments on IWSLT and WMT tasks are conducted on 1 and 8 M40 GPUs respectively. We train models on 8 M40 GPUs, and the batchsize is 2000 tokens per GPU.
Software Dependencies No No specific version numbers for ancillary software dependencies were provided. The paper mentions: “We mainly follow the scripts below to preprocess the data: https://github.com/pytorch/fairseq/tree/master/examples/translation”, “We leverage the pre-trained models provided by PyTorch-Transformers”, and evaluation tools like “multi-bleu.perl” and “sacre BLEU”, but no version numbers for these software components are listed.
Experiment Setup Yes The model configuration is transformer iwslt de en, representing a six-layer model with embedding size 512 and FFN layer dimension 1024. For WMT 14 En De and En Fr, we use transformer big setting (short for transformer vaswani wmt en de big) with dropout 0.3 and 0.1 respectively. In this setting, the aforementioned three parameters are 1024, 4096 and 6 respectively. The BERT models are fixed during training. The drop-net rate pnet is set as 1.0. We first train an NMT model until convergence, then initialize the encoder and decoder of the BERT-fused model with the obtained model. The BERT-encoder attention and BERT-decoder attention are randomly initialized. The batchsize is 4k tokens per GPU. Following (Ott et al., 2018), for WMT tasks, we accumulate the gradient for 16 iterations and then update to simulate a 128-GPU environment. The optimization algorithm is Adam (Kingma & Ba, 2014) with initial learning rate 0.0005 and inverse sqrt learning rate scheduler (Vaswani et al., 2017). For WMT 14 En De, we use beam search with width 4 and length penalty 0.6 for inference following (Vaswani et al., 2017). For other tasks, we use width 5 and length penalty 1.0.