Word Attention for Sequence to Sequence Text Understanding

Authors: Lijun Wu, Fei Tian, Li Zhao, Jianhuang Lai, Tie-Yan Liu

AAAI 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results on abstractive summarization and neural machine translation show that word attention significantly improve over strong baselines. In particular, we achieve the state-of-the-art result on WMT 14 English-French translation task with 12M training data. To evaluate our approach, we carried out experiments on two typical sequence to sequence text understanding tasks: abstractive summarization and neural machine translation.
Researcher Affiliation Collaboration Lijun Wu,1 Fei Tian,2 Li Zhao,2 Jianhuang Lai,1,3 Tie-Yan Liu2 1School of Data and Computer Science, Sun Yat-sen University 2Microsoft Research 3Guangdong Key Laboratory of Information Security Technology wulijun3@mail2.sysu.edu.cn; {fetia, lizo, tie-yan.liu}@microsoft.com; stsljh@mail.sysu.edu.cn
Pseudocode No The paper describes the model mathematically but does not include structured pseudocode or algorithm blocks.
Open Source Code No The paper does not provide concrete access to its own source code, nor does it explicitly state that the code is publicly available.
Open Datasets Yes Abstractive Summarization We train on the Gigaword corpus (Graff and Cieri 2003) and pre-process it identically to (Rush, Chopra, and Weston 2015; Shen et al. 2016), resulting in 3.8M training article-headline pairs, 190k for validation and 2, 000 for test. Neural Machine Translation For De-En, we use data from the De-En machine translation track of the IWSLT 2014 evaluation campaign (Cettolo et al. 2014)... For En-Fr, we use a widely adopted benchmark dataset (Jean et al. 2014; Zhou et al. 2016; Wang et al. 2017) which is the subset of WMT 14 En-Fr training corpus, consisting of 12M sentences pairs.
Dataset Splits Yes Abstractive Summarization We train on the Gigaword corpus (Graff and Cieri 2003) and pre-process it identically to (Rush, Chopra, and Weston 2015; Shen et al. 2016), resulting in 3.8M training article-headline pairs, 190k for validation and 2, 000 for test. For De-En, ... The training/dev/test data set respectively contains about 153k/7k/7k De-En sentences pairs... For En-Fr, we use a widely adopted benchmark dataset ...newstest 2012 and newstest 2013 are concatenated as the dev set and newstest 2014 acts as test set.
Hardware Specification Yes All our models are implemented with Theano (Theano Development Team 2016) and trained on TITAN Xp GPU. For summarization task, it takes about 1 day on one GPU; for De-En 2-layer model, it takes about 4 hours on one GPU; for En-Fr 4-4 layer model, the training takes roughly 17 days on 4 GPUs to converge, with batch size on each GPU as 32 and gradients on each GPU summed together via Nvidia NCCL.
Software Dependencies Yes All our models are implemented with Theano (Theano Development Team 2016)
Experiment Setup Yes The embedding size of our model is 620, and the LSTM hidden state size in both encoder and decoder is 1024. The initial values of all weight parameters are uniformly sampled between ( 0.05, 0.05). We train our word attention enhanced model by Adadelta (Zeiler 2012) with learning rate 1.0 and gradient clipping threshold 1.5 (Pascanu, Mikolov, and Bengio 2013). The mini-batch size is 64 and the learning rate is halved when the dev performance stops increasing. For De-En, we use a single-layer LSTM model with the dimension of both embedding and hidden state to be 256. Similar to summarization task, we also train the model by Adadelta with learning rate 1.0. The dropout rate is 0.15, the gradient is clipped by 2.5, and the batch size is 32. For En-Fr, we directly set the RNNsearch baseline as a 4-layer encoder and 4-layer decoder model and run our model on top of it, with embedding size 512, hidden state size 1024, and the dropout ratio 0.1.