Sequence Generation: From Both Sides to the Middle

Authors: Long Zhou, Jiajun Zhang, Chengqing Zong, Heng Yu

IJCAI 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on neural machine translation (En De, Ch En, and En Ro) and text summarization tasks show that the proposed model significantly speeds up decoding while improving the generation quality compared to the autoregressive Transformer. We extensively evaluate the proposed model on typical sequence generation tasks, namely neural machine translation and text summarization.
Researcher Affiliation Collaboration Long Zhou1,2 , Jiajun Zhang1,2 , Chengqing Zong1,2,3 and Heng Yu4 1University of Chinese Academy of Sciences, Beijing, China 2National Laboratory of Pattern Recognition, CASIA, Beijing, China 3CAS Center for Excellence in Brain Science and Intelligence Technology, Shanghai, China 4Machine Intelligence Technology Lab, Alibaba Group
Pseudocode No The paper does not contain any pseudocode or algorithm blocks.
Open Source Code No The paper does not provide an explicit statement or link for the open-source code of their methodology. It mentions using the tensor2tensor toolkit, which is a third-party tool.
Open Datasets Yes We verify our model on three translation datasets of different sizes: WMT14 English-German (En De), NIST Chinese English (Ch En), WMT16 English-Romanian (En Ro), whose training sets consist of 4.5M, 2.0M, 0.6M sentence pairs, respectively. We conduct text summarization experiments on English Gigaword dataset. The parallel corpus is produced by pairing the first sentence and the headline in the news article with some heuristic rules. The extracted corpus contains about 3.8M sentence-summary pairs for the training set and 189K examples for the development set. URLs provided: http://www.statmt.org/wmt14/translation-task.html, http://www.statmt.org/wmt16/translation-task.html, https://github.com/harvardnlp/sent-summary
Dataset Splits Yes For En De, we use newstest2013 as the validation set and newstest2014 as the test set. For Ch En, we utilize BPE to encode Chinese and English respectively, and limit the source and target vocabularies to the most frequent 30K tokens. We use NIST 2006 as the validation set, NIST 2003-2005 as our test sets. For En Ro, we use newsdev2016 and newstest-2016 as development and test sets. The extracted corpus contains about 3.8M sentence-summary pairs for the training set and 189K examples for the development set.
Hardware Specification No We use three GPUs to train En De and one GPU for the other two language pairs. This statement is too general and does not specify particular GPU models or other hardware components.
Software Dependencies No We implement the proposed model based on the tensor2tensor toolkit. The paper mentions a software toolkit but does not provide specific version numbers for it or any other libraries.
Experiment Setup Yes For our bidirectional Transformer model, we employ the Adam optimizer with β1=0.9, β2=0.998, and ϵ=10^-9. We use the same warmup and decay strategy for learning rate as Vaswani et al. [2017] , with 16,000 warmup steps. During training, we employ label smoothing of value ϵls=0.1. For evaluation, we use beam search with a beam size of k=4 and length penalty α=0.6. Besides, we use 6 encoder and decoder layers, 512 hidden size, 8 attention-heads, 2048 feed-forward inner-layer dimensions.