Generative Neural Machine Translation

Authors: Harshil Shah, David Barber

NeurIPS 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section we evaluate the effectiveness of GNMT and GNMT-MULTI on the 6 permutations of language pairs between English (EN), Spanish (ES) and French (FR) i.e. EN ES, ES EN, EN FR, etc. We also train GNMT-MULTI in a semi-supervised manner, as described in section 2.6, and refer to this as GNMT-MULTI-SSL. We compare the performance of GNMT, GNMT-MULTI, and GNMT-MULTI-SSL against that of VNMT, which we believe to be the most closely related model to our work.
Researcher Affiliation Collaboration Harshil Shah1 David Barber1,2,3 1University College London 2Alan Turing Institute 3reinfer.io
Pseudocode Yes Algorithm 1 Generating translations; Algorithm 2 Translating when there are missing words
Open Source Code No No explicit statement or link for open-source code release is provided.
Open Datasets Yes We use paired data provided by the Multi UN corpus1[Tiedemann, 2012]. [...] For the monolingual data used to train GNMT-MULTI-SSL, we use the News Crawl articles from 2009 to 2012, provided for the WMT 13 translation task.
Dataset Splits Yes We train each model with a small, medium and large amount of paired data, corresponding to 40K, 400K and 4M paired sentences respectively. For each language pair, we create validation sets of size 5K and test sets of size 10K paired sentences respectively.
Hardware Specification No No specific hardware details (e.g., GPU/CPU models, memory) are provided for the experimental setup.
Software Dependencies No We implement both models in Python, using the Theano [Theano Development Team, 2016] and Lasagne [Dieleman et al., 2015] libraries.
Experiment Setup Yes The latent representation z has 100 units, each of the RNN hidden states has 1,000 units, and the word embeddings are 300-dimensional. [...] KL divergence annealing We multiply the KL divergence term by a constant weight, which we linearly anneal from 0 to 1 over the first 50,000 iterations of training [...] Word dropout [...] This is parameterized by a drop rate, which we set to 30%.