reproducibilityindex.ai

Dual Transfer Learning for Neural Machine Translation with Marginal Distribution Regularization

Authors: Yijun Wang, Yingce Xia, Li Zhao, Jiang Bian, Tao Qin, Guiquan Liu, Tie-Yan Liu

AAAI 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiment results on English French and German English tasks demonstrate that dual transfer learning achieves signiﬁcant improvement over several strong baselines and obtains new state-of-the-art results. We conducted a set of experiments on two translation tasks to test the proposed method.
Researcher Affiliation	Collaboration	1Anhui Province Key Lab. of Big Data Analysis and Application, University of Science and Technology of China 2University of Science and Technology of China 3Microsoft Research Asia
Pseudocode	Yes	Algorithm 1 Dual transfer learning with marginal distribution regularization
Open Source Code	No	The paper does not provide concrete access to its own source code, such as a specific repository link or an explicit statement of code release. The only code-related link is for a third-party tool used (BPE).
Open Datasets	Yes	For English French task, we used a subset of the bilingual corpus from WMT 14 for training, which contains 12M sentence pairs. For German English task, the bilingual corpus is from IWSLT 2014 evaluation campaign (Cettolo et al. 2014), containing about 153k sentence pairs for training, and 7k/6.5k sentence pairs for validation/test.
Dataset Splits	Yes	We concatenated newstest2012 and newstest2013 as the validation set, and used newstest2014 as the test set. The validation and test sets for English French contain 6k and 3k sentence pairs respectively. For German English task, the bilingual corpus is from IWSLT 2014 evaluation campaign (Cettolo et al. 2014), containing about 153k sentence pairs for training, and 7k/6.5k sentence pairs for validation/test.
Hardware Specification	Yes	Models were optimized by Ada Delta (Zeiler 2012) on M40 GPU until convergence.
Software Dependencies	No	The paper mentions optimizers like Adam and Ada Delta, but does not provide specific software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup	Yes	The sample size K and the hyperparameter λ in our method were set as 2 and 0.05 respectively according to the trade-off between validation performance and training time. To be speciﬁc, GRUs were applied as the recurrent units. The dimensions of word embedding and hidden state were 620 and 1000 respectively. We constructed the vocabulary with the most common 30K words in the parallel corpora. Out-of-vocabulary words were replaced with a special token UNK . For monolingual corpora, we removed the sentences containing out-of-vocabulary words. In order to prevent over-ﬁtting, we applied dropout during training (Zaremba, Sutskever, and Vinyals 2014), where the dropout probability was 0.1. Gradient clipping was used with clipping value 1.0 and 2.5 for English French and German English respectively.