Dual Transfer Learning for Neural Machine Translation with Marginal Distribution Regularization
Authors: Yijun Wang, Yingce Xia, Li Zhao, Jiang Bian, Tao Qin, Guiquan Liu, Tie-Yan Liu
AAAI 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiment results on English French and German English tasks demonstrate that dual transfer learning achieves significant improvement over several strong baselines and obtains new state-of-the-art results. We conducted a set of experiments on two translation tasks to test the proposed method. |
| Researcher Affiliation | Collaboration | 1Anhui Province Key Lab. of Big Data Analysis and Application, University of Science and Technology of China 2University of Science and Technology of China 3Microsoft Research Asia |
| Pseudocode | Yes | Algorithm 1 Dual transfer learning with marginal distribution regularization |
| Open Source Code | No | The paper does not provide concrete access to its own source code, such as a specific repository link or an explicit statement of code release. The only code-related link is for a third-party tool used (BPE). |
| Open Datasets | Yes | For English French task, we used a subset of the bilingual corpus from WMT 14 for training, which contains 12M sentence pairs. For German English task, the bilingual corpus is from IWSLT 2014 evaluation campaign (Cettolo et al. 2014), containing about 153k sentence pairs for training, and 7k/6.5k sentence pairs for validation/test. |
| Dataset Splits | Yes | We concatenated newstest2012 and newstest2013 as the validation set, and used newstest2014 as the test set. The validation and test sets for English French contain 6k and 3k sentence pairs respectively. For German English task, the bilingual corpus is from IWSLT 2014 evaluation campaign (Cettolo et al. 2014), containing about 153k sentence pairs for training, and 7k/6.5k sentence pairs for validation/test. |
| Hardware Specification | Yes | Models were optimized by Ada Delta (Zeiler 2012) on M40 GPU until convergence. |
| Software Dependencies | No | The paper mentions optimizers like Adam and Ada Delta, but does not provide specific software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions). |
| Experiment Setup | Yes | The sample size K and the hyperparameter λ in our method were set as 2 and 0.05 respectively according to the trade-off between validation performance and training time. To be specific, GRUs were applied as the recurrent units. The dimensions of word embedding and hidden state were 620 and 1000 respectively. We constructed the vocabulary with the most common 30K words in the parallel corpora. Out-of-vocabulary words were replaced with a special token UNK . For monolingual corpora, we removed the sentences containing out-of-vocabulary words. In order to prevent over-fitting, we applied dropout during training (Zaremba, Sutskever, and Vinyals 2014), where the dropout probability was 0.1. Gradient clipping was used with clipping value 1.0 and 2.5 for English French and German English respectively. |