Generating Wikipedia by Summarizing Long Sequences

Authors: Peter J. Liu*, Mohammad Saleh*, Etienne Pot, Ben Goodrich, Ryan Sepassi, Lukasz Kaiser, Noam Shazeer

ICLR 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In experiments we evaluate based on perplexity (per-wordpiece), a common language modeling metric, and ROUGE-L F1 (version ROUGE-1.5.5), a common metric used in summarization. As we see from Table 4, seq2seq-attention as a baseline does quite poorly on this task compared to the Transformer architectures.
Researcher Affiliation Industry Peter J. Liu , Mohammad Saleh , Etienne Pot , Ben Goodrich, Ryan Sepassi, Łukasz Kaiser, Noam Shazeer Google Brain Mountain View, CA {peterjliu,msaleh,epot,bgoodrich,rsepassi,lukaszkaiser,noam}@google.com
Pseudocode No No pseudocode or clearly labeled algorithm blocks are present in the paper.
Open Source Code No We use the open-source tensor2tensor library for training abstractive models and will be releasing our abstractive modeling code extensions. Further details are available at https:// goo.gl/w Suu S9.
Open Datasets Yes We also provide code that that extracts content from the Common Crawl dataset, which is freely available for download. To encourage further research on large-scale summarization, we will release the URLs used in our experiments (the Wikipedia URL as well as the URLs of its references).
Dataset Splits Yes We divide the articles roughly into 80/10/10 for train/development/test subsets, resulting in 1865750, 233252, and 232998 examples respectively.
Hardware Specification Yes Transformer-Decoder, which we found could learn and improve up to L = 4000, before running out of memory on our machines equipped with 16GB of GPU RAM (NVIDIA P100).
Software Dependencies No For all abstractive model training, we use the open-source tensor2tensor library.
Experiment Setup Yes The seq2seq baseline had a hidden size of 128 with 2 layers (we use the hyper-parameter set defined in the library as lstm attention). For the Transformer encoder-decoder (T-ED), we use the hyper-parameter set transfomer base v1 and train for 1 million steps. Unless otherwise stated, during decoding we use a beam search of size 4 and length penalty α = 0.6 (Wu et al., 2016) and decode until an end-of-sequence token is reached.