A Deep Reinforced Model for Abstractive Summarization

Authors: Romain Paulus, Caiming Xiong, Richard Socher

ICLR 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate this model on the CNN/Daily Mail and New York Times datasets. Our model obtains a 41.16 ROUGE-1 score on the CNN/Daily Mail dataset, an improvement over previous state-of-the-art models. Human evaluation also shows that our model produces higher quality summaries. Our results for the CNN/Daily Mail dataset are shown in Table 1, and for the NYT dataset in Table 2.
Researcher Affiliation Industry Romain Paulus, Caiming Xiong & Richard Socher Salesforce Research 575 High Street Palo Alto, CA 94301, USA {rpaulus,cxiong,rsocher}@salesforce.com
Pseudocode No The paper provides mathematical equations and descriptions of its model and training procedures but does not include any pseudocode or algorithm blocks.
Open Source Code No The paper does not contain any statement about releasing source code for the described methodology, nor does it provide a link to a code repository.
Open Datasets Yes We evaluate our model on a modified version of the CNN/Daily Mail dataset (Hermann et al., 2015), following the same pre-processing steps described in Nallapati et al. (2016). The New York Times (NYT) dataset (Sandhaus, 2008) is a large collection of articles published between 1996 and 2007.
Dataset Splits Yes The final dataset contains 287,113 training examples, 13,368 validation examples and 11,490 testing examples. We created our own training, validation, and testing splits for this dataset. Instead of producing random splits, we sorted the documents by their publication date in chronological order and used the first 90% (589,284 examples) for training, the next 5% (32,736) for validation, and the remaining 5% (32,739) for testing.
Hardware Specification No The paper does not specify the hardware used for running the experiments, only mentioning the use of LSTMs and the number of trainable parameters.
Software Dependencies No The paper mentions using Adam optimizer and Stanford tokenizer/NER, but does not provide specific version numbers for these or any other software dependencies required for reproducibility. It also mentions using GloVe for word embeddings, but this is a pre-trained model/data, not a software dependency with a version.
Experiment Setup Yes We use a γ = 0.9984 for the ML+RL loss function. We use two 200-dimensional LSTMs for the bidirectional encoder and one 400-dimensional LSTM for the decoder. We limit the input vocabulary size to 150,000 tokens, and the output vocabulary to 50,000 tokens by selecting the most frequent tokens in the training set. Input word embeddings are 100-dimensional. We train all our models with Adam (Kingma & Ba, 2014) with a batch size of 50 and a learning rate α of 0.001 for ML training and 0.0001 for RL and ML+RL training. At test time, we use beam search of width 5 on all our models to generate our final predictions.