DTMT: A Novel Deep Transition Architecture for Neural Machine Translation

Authors: Fandong Meng, Jinchao Zhang224-231

AAAI 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments show that with the specially designed deep transition modules, our DTMT can achieve remarkable improvements on translation quality. Experimental results on Chinese English translation task show that DTMT can outperform the Transformer model by +2.09 BLEU points and achieve the best results ever reported in the same dataset. On WMT14 English German and English French translation tasks, DTMT shows superior quality to the state-of-the-art NMT systems, including the Transformer and the RNMT+.
Researcher Affiliation Industry Fandong Meng, Jinchao Zhang We Chat AI Pattern Recognition Center Tencent Inc. {fandongmeng, dayerzhang}@tencent.com
Pseudocode No The paper describes algorithms and models using mathematical formulas and textual descriptions (e.g., GRU, T-GRU, L-GRU), but it does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks or figures.
Open Source Code No The paper does not contain any explicit statement about releasing source code for the described methodology or provide a link to a code repository.
Open Datasets Yes For Zh En, the training data consists of 1.25M sentence pairs extracted from the LDC corpora. For En De and En Fr, we perform our experiments on the corpora provided by WMT14 that comprise 4.5M and 36M sentence pairs, respectively.
Dataset Splits Yes For Zh En, we choose NIST 2006 (MT06) dataset as our valid set, and NIST 2002 (MT02), 2003 (MT03), 2004 (MT04), 2005 (MT05) and 2008 (MT08) datasets as our test sets. For En De and En Fr, we use newstest2013 as the valid set, and newstest2014 as the test set.
Hardware Specification Yes For Zh En, we use 2 M40 GPUs for synchronous training and set lr0, p, s and e to 10 3, 500, 8000, and 64000 respectively. For En De, we use 8 M40 GPUs and set lr0, p, s and e to 10 4, 50, 200000, and 1200000 respectively. For En Fr, we use 8 M40 GPUs and set lr0, p, s and e to 10 4, 50, 400000, and 3000000 respectively.
Software Dependencies No The paper mentions using the Adam optimizer and multi-bleu.pl script, but it does not provide specific version numbers for these or any other software dependencies such as libraries or frameworks.
Experiment Setup Yes The parameters are initialized uniformly between [-0.08, 0.08] and updated by SGD with the learning rate controlled by the Adam optimizer (Kingma and Ba 2014) (β1 = 0.9, β2 = 0.999, and ϵ = 1e 6). We limit the length of sentences to 128 sub-words for Zh En and 256 sub-words for En De and En Fr in the training stage. We batch sentence pairs according to the approximate length, and limit input and output tokens to 4096 per GPU. For Zh En, we set dropout rates of the embedding layers, the layer before prediction and the RNN output layer to 0.5, 0.5 and 0.3 respectively. For each model of the translation tasks, the dimension of word embeddings and hidden layer is 1024. Translations are generated by beam search and log-likelihood scores are normalized by the sentence length. We set beam size = 4 and length penalty alpha = 0.6.