Tied Transformers: Neural Machine Translation with Shared Encoder and Decoder

Authors: Yingce Xia, Tianyu He, Xu Tan, Fei Tian, Di He, Tao Qin5466-5473

AAAI 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results demonstrate that such a simple method works well for both similar and dissimilar language pairs. We empirically verify our framework for both supervised NMT: we achieve a 35.52 BLEU score on IWSLT 2014 German to English translation, 28.98/29.89 BLEU scores on WMT 2014 English to German translation without/with monolingual data, and a 22.05 BLEU score on WMT 2016 unsupervised German to English translation.
Researcher Affiliation Collaboration 1Microsoft Research, 2University of Science and Technology of China 3Key Laboratory of Machine Perception, MOE, School of EECS, Peking University
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code No The paper does not provide concrete access to source code for the methodology described in this paper.
Open Datasets Yes Datasets For IWSLT 2014 De En, Es En and Ro En translation tasks, there are respectively 153k, 181k, 182k sentence pairs in each datasets1. For WMT 2014 En De and WMT 2016 En Ro translation, there are 4.5M and 2.8M bilingual sentence pairs respectively. ... 1All IWSLT 2014 training data can be found at https://wit3.fbk.eu/archive/2014-01/texts ... We collect 50M WMT monolingual data from newscrawl 2014 to newscrawl 2017.
Dataset Splits Yes For the other translation tasks, we do not lowercase the words; we use IWSLT14.TED.tst2013 as the validation sets and IWSLT14.TED.tst2014 as test sets. For En De translation, we concatenate newstest2012 and newstest2013 as the validation set and use newstest2014 as the test set. For En Ro, we use newsdev2016 as the validation set and use newstest2016 as the test. For both IWSLT and WMT translation tasks, all the datasets are preprocessed into workpieces following (Wu et al. 2016).
Hardware Specification Yes For the three IWSLT 2014 translation tasks, we train each model on two V100 GPUs for up to three days until convergence and the minibatch size is fixed as 4096 tokens per GPU, including both the source to target and target to source data. For the two WMT translation tasks, we train each model on four P40 GPUs for ten days until convergence. ... The model is implemented in Tensor Flow and trained on 8 M40 GPUs.
Software Dependencies No The paper mentions ‘Tensor Flow’ and various scripts like ‘multi-bleu.perl’, ‘sacreBLEU’, and ‘detokenizer.perl’ but does not provide specific version numbers for any software component.
Experiment Setup Yes Model Configurations For the IWSLT 2014 translation tasks, we choose the transformer small setting with 8 blocks, where each block contains a self-attention layer, an optional encoder-to-decoder attention layer and a feedforward layer. The word embedding dimension, hidden state dimension, non-linear layer dimension and the number of head are 256, 256 and 1024 and 4 respectively. For WMT 2014 En De translation and WMT 2016 En Ro task, we choose the transformer big setting, where the four numbers are 1024, 1024, 4096 and 16 respectively. The dropout rate for the two settings are 0.1 and 0.2. ... the minibatch size is fixed as 4096 tokens per GPU... apply Adam optimizer with learning rate 0.0002 to optimize the network. ... In the inference phase, we use beam search with beam width 4 and set length penalty α as 0.6.