Decoding with Value Networks for Neural Machine Translation
Authors: Di He, Hanqing Lu, Yingce Xia, Tao Qin, Liwei Wang, Tie-Yan Liu
NeurIPS 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct a set of experiments on several translation tasks. All the results demonstrate the effectiveness and robustness of the new decoding mechanism compared to several baseline algorithms. |
| Researcher Affiliation | Collaboration | 1Key Laboratory of Machine Perception, MOE, School of EECS, Peking University 2Carnegie Mellon University 3University of Science and Technology of China 4Microsoft Research 5Center for Data Science, Peking University, Beijing Institute of Big Data Research |
| Pseudocode | Yes | The details of the decoding process are presented in Algorithm 2, and we call our neural network-based decoding algorithm NMT-VNN for short. |
| Open Source Code | No | The paper states that 'For NMT-BS, we directly used the open source code [2]', referring to a baseline's code, but does not provide or explicitly state that their own code for NMT-VNN is open-source or publicly available. |
| Open Datasets | Yes | In detail, we used the same bilingual corpora from WMT 14 as used in [2] , which contains 12M, 4.5M and 10M training data for each task. |
| Dataset Splits | Yes | Following common practices, for En Fr and En De, we concatenated newstest2012 and newstest2013 as the validation set, and used newstest2014 as the testing set. For Zh En, we used NIST 2006 and NIST 2008 datasets for testing, and use NIST 2004 dataset for validation. |
| Hardware Specification | Yes | The NMT model was trained with asynchronized SGD on four K40m GPUs for about seven days. ...the value network model was trained with Ada Delta [21] on one K40m GPU for about three days. |
| Software Dependencies | No | The paper mentions 'Ada Delta [21]' as an optimization algorithm and 'multi-bleu.perl script' for evaluation, but does not specify version numbers for general software dependencies like programming languages or deep learning frameworks (e.g., Python 3.x, TensorFlow 2.x). |
| Experiment Setup | Yes | For each language, we constructed the vocabulary with the most common 30K words in the parallel corpora, and out-of-vocabulary words were replaced with a special token UNK". Each word was embedded into a vector space of 620 dimensions, and the dimension of the recurrent unit was 1000. We removed sentences with more than 50 words from the training set. Batch size was set as 80 with 20 batches pre-fetched and sorted by sentence lengths. |