Dual Learning for Machine Translation
Authors: Di He, Yingce Xia, Tao Qin, Liwei Wang, Nenghai Yu, Tie-Yan Liu, Wei-Ying Ma
NeurIPS 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments show that dual-NMT works very well on English French translation; especially, by learning from monolingual data (with 10% bilingual data for warm start), it achieves a comparable accuracy to NMT trained from the full bilingual data for the French-to-English translation task. |
| Researcher Affiliation | Collaboration | 1Key Laboratory of Machine Perception (MOE), School of EECS, Peking University 2University of Science and Technology of China 3Microsoft Research |
| Pseudocode | Yes | Algorithm 1 The dual-learning algorithm |
| Open Source Code | No | The paper states 'We leverage a tutorial NMT system implemented by Theano for all the experiments. 2dl4mt-tutorial: https://github.com/nyu-dl'. This link is to a tutorial system used, not to the specific open-source implementation of the dual-NMT methodology described in this paper by the authors. |
| Open Datasets | Yes | In detail, we used the same bilingual corpora from WMT 14 as used in [1, 5], which contains 12M sentence pairs extracting from five datasets: Europarl v7, Common Crawl corpus, UN corpus, News Commentary, and 109French-English corpus. ... We used the News Crawl: articles from 2012 provided by WMT 14 as monolingual data. |
| Dataset Splits | Yes | Following common practices, we concatenated newstest2012 and newstest2013 as the validation set, and used newstest2014 as the testing set. |
| Hardware Specification | Yes | Each of the baseline models was trained with Ada Delta [15] on K40m GPU until their performances stopped to improve on the validation set. |
| Software Dependencies | No | The paper mentions 'Theano' and 'Ada Delta' but does not provide specific version numbers for these or any other software components. |
| Experiment Setup | Yes | We used the GRU networks and followed the practice in [1] to set experimental parameters. ... Each word was projected into a continuous vector space of 620 dimensions, and the dimension of the recurrent unit was 1000. We removed sentences with more than 50 words from the training set. Batch size was set as 80 with 20 batches pre-fetched and sorted by sentence lengths. ... trained with Ada Delta [15]... We set the beam search size to be 2 in the middle translation process. ... during testing we used beam search [12] with beam size of 12 for all the algorithms as in many previous works. |