PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization

Authors: Jingqing Zhang, Yao Zhao, Mohammad Saleh, Peter Liu

ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluated our best PEGASUS model on 12 downstream summarization tasks spanning news, science, stories, instructions, emails, patents, and legislative bills. Experiments demonstrate it achieves state-of-the-art performance on all 12 downstream datasets measured by ROUGE scores.
Researcher Affiliation Collaboration 1Data Science Institute, Imperial College London, London, UK 2Brain Team, Google Research, Mountain View, CA, USA.
Pseudocode Yes Algorithm 1 Sequential Sentence Selection
Open Source Code Yes The training code and instructions for using model checkpoints can be found at https://github.com/google-research/ pegasus
Open Datasets Yes For downstream summarization, we only used public abstractive summarization datasets, and access them through Tensor Flow Summarization Datasets 1, which provides publicly reproducible code for dataset processing and train/validation/test splits. 1https://www.tensorflow.org/datasets/ catalog/overview
Dataset Splits Yes We used train/validation/test ratio of 80/10/10 if no split was provided, and 10% train split as validation if there was no validation split.
Hardware Specification No The paper describes model architectures (e.g., layers, hidden size), but does not specify the type or model of hardware (e.g., GPU, CPU, TPU) used for training or experimentation.
Software Dependencies No The paper mentions software components like Adafactor, Byte-pair encoding (BPE), and Sentence Piece Unigram, but does not provide specific version numbers for these or other libraries used for the experiments.
Experiment Setup Yes We pre-trained PEGASUSBASE with a batch size of 256 and PEGASUSLARGE with a batch size of 8192. We used sinusoidal positional encoding following Vaswani et al. (2017). For optimization, both pre-training and finetuning used Adafactor (Shazeer & Stern, 2018) with square root learning rate decay and dropout rate of 0.1. We used greedy-decoding for studies in Section 6.1, and used beam-search with a length-penalty, α, as in Wu et al. (2016) for the final large model. All experiments hyper parameters can be found in Appendix C and reported numbers are in Appendix D and E. PEGASUSBASE had L = 12, H = 768, F = 3072, A = 12 and PEGASUSLARGE had L = 16, H = 1024, F = 4096, A = 16.