Towards Making the Most of BERT in Neural Machine Translation

Authors: Jiacheng Yang, Mingxuan Wang, Hao Zhou, Chengqi Zhao, Weinan Zhang, Yong Yu, Lei Li9378-9385

AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments in machine translation show CTNMT gains of up to 3 BLEU score on the WMT14 English-German language pair which even surpasses the previous state-of-the-art pretraining aided NMT by 1.4 BLEU score.
Researcher Affiliation Collaboration Jiacheng Yang,1 Mingxuan Wang,2 Hao Zhou,2 Chengqi Zhao,2 Weinan Zhang,1 Yong Yu,1 Lei Li2 1Shanghai Jiao Tong University 2Byte Dance AI Lab, Beijing, China
Pseudocode No The paper describes its proposed methods (asymptotic distillation, dynamic switch, rate-scheduled learning) in narrative text and figures, but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code No The paper mentions external public resources like BERT and GPT-2 models (footnotes 4 and 5) and WMT datasets (footnote 3 and 2) to ensure reproducibility. However, it does not provide any specific link or explicit statement about releasing the source code for the CTNMT framework described in the paper itself.
Open Datasets Yes All the training and testing datasets are public. For English-German, to compare with the results reported by previous work, we used the same subset of the WMT 2014 training corpus that contains 4.5M sentence pairs with 91M English words and 87M German words. ... http://www.statmt.org/wmt14/translation-task.html
Dataset Splits Yes The concatenation of news-test 2012 and news-test 2013 is used as the validation set and news-test 2014 as the test set. (for English-German). The concatenation of news-test 2012 and news-test 2013 serves as the validation set and news-test 2014 as the test set. (for English-French). We choose WMT 2017 dataset as our development set and WMT 2018 as our test sets. (for English-Chinese)
Hardware Specification Yes We train for 100,000 steps on 8 V100 GPUs, each of which results in training batch contained approximately 8192 16 source and target tokens respectively.
Software Dependencies No The paper mentions using 'multi-bleu.pl' for evaluation and applies 'public BERT and GPT-2 model' for pre-trained LMs. However, it does not specify software dependencies with version numbers, such as Python versions or specific deep learning framework versions (e.g., TensorFlow, PyTorch, with their respective versions).
Experiment Setup Yes During training, we employ label smoothing of value ϵ = 0.1(Pereyra et al. 2017). ... limited input and output tokens per batch to 8192 per GPU. We train our NMT model with the sentences of length up to 150 words in the training data. We train for 100,000 steps on 8 V100 GPUs... We use a beam width of 8 and length penalty to 0.6 in all the experiments. For our small model, the dimensions of all the hidden states were set to 768 and for the big model, the dimensions were set to 1024. ...For rate-scheduled learning in Eq. (10), T is set to 10,000 and T is set to 20,000...