Joint Training for Neural Machine Translation Models with Monolingual Data
Authors: Zhirui Zhang, Shujie Liu, Mu Li, Ming Zhou, Enhong Chen
AAAI 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiment results on Chinese-English and English-German translation tasks show that our approach can simultaneously improve translation quality of source-to-target and target-to-source models, significantly outperforming strong baseline systems which are enhanced with monolingual data for model training including back-translation. |
| Researcher Affiliation | Collaboration | University of Science and Technology of China, Hefei, China Microsoft Research |
| Pseudocode | Yes | Algorithm 1 Joint Training Algorithm for NMT |
| Open Source Code | No | The paper does not provide any specific links or explicit statements about the open-source availability of the code for the described methodology. |
| Open Datasets | Yes | For Chinese English translation, we select our training data from LDC corpora2, which consists of 2.6M sentence pairs with 65.1M Chinese words and 67.1M English words respectively. We use 8M Chinese sentences and 8M English sentences randomly extracted from Xinhua portion of Gigaword corpus as the monolingual data sets. Any sentence longer than 60 words is removed from training data (both the bilingual data and pseudo bilingual data). For English German translation, we choose the WMT 14 training corpus used in Jean et al. (2015). |
| Dataset Splits | Yes | For Chinese-English, NIST Open MT 2006 evaluation set is used as validation set, and NIST 2003, NIST 2005, NIST 2008, NIST2012 datasets as test sets. The concatenation of news-test 2012 and news-test 2013 is used as the validation set and news-test 2014 as the test set. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU/CPU models, memory, or specific cloud instance types) used for running its experiments. |
| Software Dependencies | No | The paper mentions using 'RNNSearch model proposed by Bahdanau, Cho, and Bengio (2014)' and optimization with 'Adadelta (Zeiler 2012) algorithm', and 'Byte Pair Encoding (Sennrich, Haddow, and Birch 2016b)'. However, it does not specify version numbers for any software or libraries (e.g., Python, TensorFlow, PyTorch versions). |
| Experiment Setup | Yes | The size of word embedding (for both source and target words) is 256 and the size of hidden layer is set to 1024. The parameters are initialized using a normal distribution with a mean of 0 and a variance of 6/(drow + dcol). Our models are optimized with the Adadelta (Zeiler 2012) algorithm with mini-batch size 128. We re-normalize gradient if its norm is larger than 2.0 (Pascanu, Mikolov, and Bengio 2013). At test time, beam search with size 8 is employed to find the best translation, and translation probabilities are normalized by the length of the translation sentences. In practice, we first sort all monolingual data according to the sentence length and then 64 sentences are simultaneously translated with parallel decoding implementation. As for model training, we find that 4-5 EM iterations are enough to converge. |