Exploring Explainable Selection to Control Abstractive Summarization

Authors: Haonan Wang, Yang Gao, Yu Bai, Mirella Lapata, Heyan Huang13933-13941

AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In a series of experiments assessed with ROUGE metrics and two human evaluations, ESCA outperformed eight state-of-the-art models on the CNN/Daily Mail and NYT50 benchmark datasets.
Researcher Affiliation Academia Haonan Wang1, Yang Gao1, , Yu Bai1, Mirella Lapata2, Heyan Huang1,3 1School of Computer Science and Technology, Beijing Institute of Technology 2Institute for Language, Cognition and Computation, School of Informatics, University of Edinburgh 3 Beijing Engineering Research Center of High Volume Language Information Processing and Cloud Computing Applications
Pseudocode No The paper describes the model architecture and process in text and diagrams (Figure 1) but does not provide pseudocode or algorithm blocks.
Open Source Code Yes Our code and dataset samples are available on https://github.com/Wanghn95/Esca_Code
Open Datasets Yes We evaluated our models and baselines on two benchmark datasets, namely the CNN/Daily Mail news set (Hermann et al. 2015), and the New York Annotated Corpus (NYT) (Sandhaus 2008).
Dataset Splits Yes We followed the standard splits 90,266/1220/1093 for the training, validation and testing sets for the CNN dataset and 196,961/12,148/10,397 for the Daily Mail dataset.
Hardware Specification No The paper describes training configurations and model parameters but does not specify the hardware (e.g., GPU/CPU models, memory) used for experiments.
Software Dependencies Yes All sentences were split with the Stanford Core NLP toolkit (Manning et al. 2014). We used ROUGE as the evaluation metric (Lin 2004)5, which measures the quality of a summary by computing the overlapping lexical elements between the candidate summary and a reference summary. Following previous practice, we assessed R-1 (unigram), R-2 (bigram) and R-L (longest common subsequence).5Implemented by pyrouge package based on ROUGE1.5.5. We used the standard BERT-base-uncased version of BERT 6.
Experiment Setup Yes ESCA-Transformer was trained with a 6-layer transformer with 8 heads. The hidden size was set to 512, and the feed-forward dimension for the multi-head attention was set to 1024. We used dropout with a probability of 0.2 prior to the linear layers. The learning rate for the pointer-generator was 0.15 with a batch size for the encoder of 32 and a beam size for the decoder of 4. The learning rate of both of them was 0.15. At the testing phase, we limited the length of the summary to 120 words. The model was trained with an early stopping and length penalty imposed on the validation set. ESCA-BERT followed the settings specified by Liu and Lapata (2019). Specifically, we inserted [CLS] tokens at the start of each sentence, and also used two-interval segment embeddings [퐸퐴] or [퐸퐵] to distinguish between multiple sentences in a document. The [CLS] then learned the sentence embedding. Position embeddings in the BERT model had a 512 length limit. We used the standard BERT-base-uncased version of BERT 6. Both the source and target tokens were tokenized with BERT s subwords. The hidden size of the transformer layers was 768, and all the feed-forward layers had 2048 hidden units. One transformer layer in the extractor with 8 heads and a dropout of 0.1 was dedicated to producing the sentence representations. We used the trigram block trick (Paulus, Xiong, and Socher 2017) to prevent duplicates. The abstractor was trained over 15k iterations for the NYT dataset and 100k iterations for CNN/DM with label smoothing loss (Szegedy et al. 2016) at a factor of 0.1. Moreover, dropout with a probability of 0.2 was applied prior to the linear layers. The decoder contained 6 transformer layers. We used separate learning rates of 0.002 and 0.2 for the BERT encoder and Transformer decoder, respectively. The settings for the decoding process were the same as those outlined for the Transformer-based model above. The final model contained 180M parameters.