reproducibilityindex.ai

Generating Wikipedia by Summarizing Long Sequences

Authors: Peter J. Liu*, Mohammad Saleh*, Etienne Pot, Ben Goodrich, Ryan Sepassi, Lukasz Kaiser, Noam Shazeer

ICLR 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In experiments we evaluate based on perplexity (per-wordpiece), a common language modeling metric, and ROUGE-L F1 (version ROUGE-1.5.5), a common metric used in summarization. As we see from Table 4, seq2seq-attention as a baseline does quite poorly on this task compared to the Transformer architectures.
Researcher Affiliation	Industry	Peter J. Liu , Mohammad Saleh , Etienne Pot , Ben Goodrich, Ryan Sepassi, Łukasz Kaiser, Noam Shazeer Google Brain Mountain View, CA {peterjliu,msaleh,epot,bgoodrich,rsepassi,lukaszkaiser,noam}@google.com
Pseudocode	No	No pseudocode or clearly labeled algorithm blocks are present in the paper.
Open Source Code	No	We use the open-source tensor2tensor library for training abstractive models and will be releasing our abstractive modeling code extensions. Further details are available at https:// goo.gl/w Suu S9.
Open Datasets	Yes	We also provide code that that extracts content from the Common Crawl dataset, which is freely available for download. To encourage further research on large-scale summarization, we will release the URLs used in our experiments (the Wikipedia URL as well as the URLs of its references).
Dataset Splits	Yes	We divide the articles roughly into 80/10/10 for train/development/test subsets, resulting in 1865750, 233252, and 232998 examples respectively.
Hardware Specification	Yes	Transformer-Decoder, which we found could learn and improve up to L = 4000, before running out of memory on our machines equipped with 16GB of GPU RAM (NVIDIA P100).
Software Dependencies	No	For all abstractive model training, we use the open-source tensor2tensor library.
Experiment Setup	Yes	The seq2seq baseline had a hidden size of 128 with 2 layers (we use the hyper-parameter set deﬁned in the library as lstm attention). For the Transformer encoder-decoder (T-ED), we use the hyper-parameter set transfomer base v1 and train for 1 million steps. Unless otherwise stated, during decoding we use a beam search of size 4 and length penalty α = 0.6 (Wu et al., 2016) and decode until an end-of-sequence token is reached.