Generating Wikipedia by Summarizing Long Sequences
Authors: Peter J. Liu*, Mohammad Saleh*, Etienne Pot, Ben Goodrich, Ryan Sepassi, Lukasz Kaiser, Noam Shazeer
ICLR 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In experiments we evaluate based on perplexity (per-wordpiece), a common language modeling metric, and ROUGE-L F1 (version ROUGE-1.5.5), a common metric used in summarization. As we see from Table 4, seq2seq-attention as a baseline does quite poorly on this task compared to the Transformer architectures. |
| Researcher Affiliation | Industry | Peter J. Liu , Mohammad Saleh , Etienne Pot , Ben Goodrich, Ryan Sepassi, Łukasz Kaiser, Noam Shazeer Google Brain Mountain View, CA {peterjliu,msaleh,epot,bgoodrich,rsepassi,lukaszkaiser,noam}@google.com |
| Pseudocode | No | No pseudocode or clearly labeled algorithm blocks are present in the paper. |
| Open Source Code | No | We use the open-source tensor2tensor library for training abstractive models and will be releasing our abstractive modeling code extensions. Further details are available at https:// goo.gl/w Suu S9. |
| Open Datasets | Yes | We also provide code that that extracts content from the Common Crawl dataset, which is freely available for download. To encourage further research on large-scale summarization, we will release the URLs used in our experiments (the Wikipedia URL as well as the URLs of its references). |
| Dataset Splits | Yes | We divide the articles roughly into 80/10/10 for train/development/test subsets, resulting in 1865750, 233252, and 232998 examples respectively. |
| Hardware Specification | Yes | Transformer-Decoder, which we found could learn and improve up to L = 4000, before running out of memory on our machines equipped with 16GB of GPU RAM (NVIDIA P100). |
| Software Dependencies | No | For all abstractive model training, we use the open-source tensor2tensor library. |
| Experiment Setup | Yes | The seq2seq baseline had a hidden size of 128 with 2 layers (we use the hyper-parameter set defined in the library as lstm attention). For the Transformer encoder-decoder (T-ED), we use the hyper-parameter set transfomer base v1 and train for 1 million steps. Unless otherwise stated, during decoding we use a beam search of size 4 and length penalty α = 0.6 (Wu et al., 2016) and decode until an end-of-sequence token is reached. |