DeepChannel: Salience Estimation by Contrastive Learning for Extractive Document Summarization

Authors: Jiaxin Shi, Chen Liang, Lei Hou, Juanzi Li, Zhiyuan Liu, Hanwang Zhang6999-7006

AAAI 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In experiments, our model not only achieves state-of-the-art ROUGE scores on CNN/Daily Mail dataset, but also shows strong robustness in the out-of-domain test on DUC2007 test set. Moreover, our model reaches a ROUGE-1 F-1 score of 39.41 on CNN/Daily Mail test set with merely 1/100 training set, demonstrating a tremendous data efficiency.
Researcher Affiliation Academia Jiaxin Shi,1 Chen Liang,1 Lei Hou,1 Juanzi Li,1 Zhiyuan Liu,1 Hanwang Zhang2 1Tsinghua University 2Nanyang Technological University {shijx12,lliangchenc}@gmail.com, {houlei,lijuanzi,liuzy}@tsinghua.edu.cn, hanwangzhang@ntu.edu.sg
Pseudocode Yes Algorithm 1 Greedy Extraction Algorithm Input: document D = {d1, d2, ..., d|D|} , a well-pretrained channel model P(D|S), expected summary length l Output: optimal summary S S {} while |S | < l do d, p nil, 0 for di D S do pi P(D|S {di}) according to Formula 3 if pi > p then d, p = di, pi end if end for S = S {d} end while Resort S based on the order in D return S
Open Source Code Yes The implementation is made publicly available.4 4https://github.com/lliangchenc/Deep Channel
Open Datasets Yes Datasets We evaluate our model on two datasets: CNN/Daily Mail (Hermann et al. 2015; Nallapati et al. 2016; See, Liu, and Manning 2017; Hsu et al. 2018) and DUC 2007.
Dataset Splits Yes We follow (Hsu et al. 2018) and obtain the non-anonymized version of this dataset which has 287,113 training pairs, 13,368 validation pairs, and 11,490 test pairs.
Hardware Specification Yes To obtain the results in Table 2, Deep Channel only needs to be trained one epoch on CNN/Daily Mail training set, taking about four hours with an Nvidia GTX 1080Ti GPU.
Software Dependencies No The paper mentions using Glo Ve for word embeddings, Adam optimizer, and GRU, but does not provide specific version numbers for these or any other software libraries or programming languages used.
Experiment Setup Yes For the model, we set the dimension of the word embedding to 300, and the GRU hidden dimension to 1024. We use a 3-layered MLP to calculate P(di|S) in Formula 2, which consists of 3 linear layers, 2 Re LU layers and an output sigmoid layer. We use dropout (Srivastava et al. 2014) with probability 0.3 after the word embedding layer and before the first layer of the MLP. ... We use Adam (Kingma and Ba 2014) optimizer with a fixed learning rate of 1e-5 to train our model. We set the weight of the penalization term α = 0.001. When extracting sentences, we fix the number of target sentences (i.e., l in Algorithm 1) to 3.