Deliberation Networks: Sequence Generation Beyond One-Pass Decoding
Authors: Yingce Xia, Fei Tian, Lijun Wu, Jianxin Lin, Tao Qin, Nenghai Yu, Tie-Yan Liu
NeurIPS 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on neural machine translation and text summarization demonstrate the effectiveness of the proposed deliberation networks. On the WMT 2014 English-to-French translation task, our model establishes a new state-of-the-art BLEU score of 41.5. |
| Researcher Affiliation | Collaboration | 1University of Science and Technology of China, Hefei, China 2Microsoft Research, Beijing, China 3Sun Yat-sen University, Guangzhou, China |
| Pseudocode | Yes | Algorithm 1: Algorithm to train the deliberation network |
| Open Source Code | No | The paper does not provide any specific links or explicit statements about releasing the source code for their methodology. |
| Open Datasets | Yes | For En Fr, we employ the standard filtered WMT 14 dataset6... For Zh En, we choose 1.25M bilingual sentence pairs from LDC dataset... The training, validation and test sets for the task are extracted from Gigaword Corpus [6] |
| Dataset Splits | Yes | We concatenate newstest2012 and newstest2013 together as the validation set and use newstest2014 as the test set. For Zh En, we choose 1.25M bilingual sentence pairs from LDC dataset as training corpus, use NIST2003 as the validation set, and NIST2004, NIST2005, NIST2006, NIST2008 as the test sets. |
| Hardware Specification | Yes | All the models are trained on a single NVIDIA K40 GPU. |
| Software Dependencies | No | The paper mentions that the models are "implemented in Theano [24]" but does not specify a version number for Theano or other software dependencies. |
| Experiment Setup | Yes | The word embedding dimension is set as 620. For Zh En, we apply 0.5 dropout rate to the layer before softmax and no dropout is used in En Fr translation. ... Plain SGD is used as the optimizer in this process, with initial learning rate 0.2 and halving according to validation accuracy. To sample the intermediate translation output by the first decoder, we use beam search with beam size 2, considering the tradeoff between accuracy and efficiency. |