Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization
Authors: Jingqing Zhang, Yao Zhao, Mohammad Saleh, Peter Liu
ICML 2020 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluated our best PEGASUS model on 12 downstream summarization tasks spanning news, science, stories, instructions, emails, patents, and legislative bills. Experiments demonstrate it achieves state-of-the-art performance on all 12 downstream datasets measured by ROUGE scores. |
| Researcher Affiliation | Collaboration | 1Data Science Institute, Imperial College London, London, UK 2Brain Team, Google Research, Mountain View, CA, USA. |
| Pseudocode | Yes | Algorithm 1 Sequential Sentence Selection |
| Open Source Code | Yes | The training code and instructions for using model checkpoints can be found at https://github.com/google-research/ pegasus |
| Open Datasets | Yes | For downstream summarization, we only used public abstractive summarization datasets, and access them through Tensor Flow Summarization Datasets 1, which provides publicly reproducible code for dataset processing and train/validation/test splits. 1https://www.tensorflow.org/datasets/ catalog/overview |
| Dataset Splits | Yes | We used train/validation/test ratio of 80/10/10 if no split was provided, and 10% train split as validation if there was no validation split. |
| Hardware Specification | No | The paper describes model architectures (e.g., layers, hidden size), but does not specify the type or model of hardware (e.g., GPU, CPU, TPU) used for training or experimentation. |
| Software Dependencies | No | The paper mentions software components like Adafactor, Byte-pair encoding (BPE), and Sentence Piece Unigram, but does not provide specific version numbers for these or other libraries used for the experiments. |
| Experiment Setup | Yes | We pre-trained PEGASUSBASE with a batch size of 256 and PEGASUSLARGE with a batch size of 8192. We used sinusoidal positional encoding following Vaswani et al. (2017). For optimization, both pre-training and finetuning used Adafactor (Shazeer & Stern, 2018) with square root learning rate decay and dropout rate of 0.1. We used greedy-decoding for studies in Section 6.1, and used beam-search with a length-penalty, α, as in Wu et al. (2016) for the final large model. All experiments hyper parameters can be found in Appendix C and reported numbers are in Appendix D and E. PEGASUSBASE had L = 12, H = 768, F = 3072, A = 12 and PEGASUSLARGE had L = 16, H = 1024, F = 4096, A = 16. |