Regularizing Neural Machine Translation by Target-Bidirectional Agreement

Authors: Zhirui Zhang, Shuangzhi Wu, Shujie Liu, Mu Li, Ming Zhou, Tong Xu443-450

AAAI 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results show that our proposed method significantly outperforms state-of-the-art baselines on Chinese-English and English-German translation tasks.
Researcher Affiliation Collaboration University of Science and Technology of China, Hefei, China Harbin Institute of Technology, Harbin, China Microsoft Research Asia
Pseudocode Yes Algorithm 1 Training Algorithm for L2R Model
Open Source Code No The paper mentions external tools and scripts (e.g., Moses scripts, Tensorflow T2T, SacreBLEU) but does not provide a link to or statement about the availability of their own source code for the proposed methodology.
Open Datasets Yes For NIST Open MT s Chinese-English translation task, we select our training data from LDC corpora,1 which consists of 2.6M sentence pairs... and For WMT17 s English-German translation task, we use the pre-processed training data provided by the task organizers.2 with footnotes providing specific LDC IDs and URLs.
Dataset Splits Yes The NIST Open MT 2006 evaluation set is used as validation set, and NIST 2003, 2005, 2008, 2012 datasets as test sets. and We use the newstest2016 as the validation set and the newstest2017 as the test set.
Hardware Specification Yes All models are trained on 4 Tesla M40 GPUs
Software Dependencies Yes The Transformer model Vaswani et al. (2017) is adopted as our baseline. For all translation tasks, we follow the transformer base v2 hyper-parameter setting5 which corresponds to a 6-layer transformer with a model size of 512. (Footnote 5 links to tensorflow/tensor2tensor/blob/v1.3.0/tensor2tensor/models/transformer.py). Also mentions Moses multibleu.perl script and official tools Sacre BLEU.
Experiment Setup Yes For all translation tasks, we follow the transformer base v2 hyper-parameter setting... which corresponds to a 6-layer transformer with a model size of 512. The parameters are initialized using a normal distribution with a mean of 0 and a variance of 6/(drow + dcol)... All models are trained on 4 Tesla M40 GPUs for a total of 100K steps using the Adam... algorithm. The initial learning rate is set to 0.2 and decayed according to the schedule in Vaswani et al. (2017). During training, the batch size is set to approximately 4096 words per batch and checkpoints are created every 60 minutes. At test time, we use a beam of 8 and a length penalty of 1.0. Other hyper-parameters used in our approach are set as λ = 1, m = 1.