Neural Machine Translation with Gumbel-Greedy Decoding
Authors: Jiatao Gu, Daniel Jiwoong Im, Victor O.K. Li
AAAI 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We extensively evaluate the proposed GGD on a large parallel corpora with different variants of generators and discriminators. The empirical results demonstrate that GGD improves translation quality. Experiments Experimental Setup Dataset We consider translating Czech-English (Cs-En) and German-English (De-En) language pairs for both directions with a standard attention-based neural machine translation system (Bahdanau, Cho, and Bengio 2014). |
| Researcher Affiliation | Collaboration | Jiatao Gu, Daniel Jiwoong Im, Victor O.K. Li The University of Hong Kong {jiataogu, wangyong, vli}@eee.hku.hk AIFounded Inc. daniel.im@aifounded.com |
| Pseudocode | Yes | Algorithm 1 Gumbel-Greedy Decoding |
| Open Source Code | No | The paper does not provide any explicit statements about releasing source code or include links to a code repository. |
| Open Datasets | Yes | We use the parallel corpora available from WMT 154 as a training set. We use newstest-2013 for the validation set to select the best model according to the BLEU scores and use newstest-2015 for the test set. 4http://www.statmt.org/wmt15/ |
| Dataset Splits | Yes | We use newstest-2013 for the validation set to select the best model according to the BLEU scores and use newstest-2015 for the test set. All the datasets were tokenized and segmented into sub-word symbols using byte-pair encoding (BPE) (Sennrich, Haddow, and Birch 2015). We use sentences of length up to 50 subword symbols for teacher forcing and 80 symbols for REINFORCE, GGD and testing. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU/CPU models or memory specifications used for running experiments. |
| Software Dependencies | No | The paper mentions several components and algorithms (e.g., GRU, Adadelta) but does not provide specific software names with version numbers for any libraries, frameworks, or environments used. |
| Experiment Setup | Yes | Our NMT model was trained with teacher forcing method (Maximum Likelihood) for approximately 300,000 updates for each language pairs. These networks were trained using Adadelta (Zeiler 2012). Learning using RMSProp (Tieleman and Hinton 2012) is most effective with the initial learning rates of 1 10 5. The generator usually gets updated much more than the discriminator. In our experiments, we used 10 updates for the generator for every discriminator s update. The four different temperature rates {5, 0.5, 0.05, 0.005} were used in the experiment. All of our models were trained with temperature rate of 0.5 in the other experiments. Beam-search (size=5) |