The Curious Case of Neural Text Degeneration

Authors: Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, Yejin Choi

ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To properly examine current maximization-based and stochastic decoding methods, we compare generations from each of these methods to the distribution of human text along several axes such as likelihood, diversity, and repetition. Our results show that (1) maximization is an inappropriate decoding objective for openended text generation, (2) the probability distributions of the best current language models have an unreliable tail which needs to be truncated during generation and (3) Nucleus Sampling is currently the best available decoding strategy for generating long-form text that is both high-quality as measured by human evaluation and as diverse as human-written text. In this study we use the Generatively Pre-trained Transformer, version 2 (GPT2; Radford et al., 2019), which was trained on Web Text, a 40GB collection of text scraped from the web. We perform experiments using the Large model (762M parameters). Our analysis is based on generating 5,000 text passages...
Researcher Affiliation Collaboration Paul G. Allen School of Computer Science & Engineering, University of Washington Allen Institute for Artificial Intelligence Department of Computer Science, University of Cape Town
Pseudocode No The paper describes algorithms using mathematical formulations and textual descriptions, but no explicit 'Pseudocode' or 'Algorithm' block was found.
Open Source Code Yes Code and all generations are available at https://github.com/ari-holtzman/degen
Open Datasets Yes In this study we use the Generatively Pre-trained Transformer, version 2 (GPT2; Radford et al., 2019), which was trained on Web Text, a 40GB collection of text scraped from the web. Available at https://github.com/openai/gpt-2-output-dataset
Dataset Splits No The paper states texts are generated conditionally, conditioned on the initial paragraph of documents in the 'held-out portion of Web Text', but does not specify the exact percentages or counts for training, validation, or test dataset splits for their experiments.
Hardware Specification No The paper mentions using the GPT-2 Large model but does not provide specific details about the hardware (e.g., GPU models, CPU types, or cloud instances) used for running their experiments.
Software Dependencies No The paper does not provide specific version numbers for software dependencies or libraries used in their experiments.
Experiment Setup Yes We perform experiments using the Large model (762M parameters). Our analysis is based on generating 5,000 text passages, which end upon reaching an end-of-document token or a maximum length of 200 tokens. Texts are generated conditionally, conditioned on the initial paragraph (restricted to 1-40 tokens) of documents in the held-out portion of Web Text, except where otherwise mentioned. (Also implicitly from Table 1 and sections 3 and 4: Beam width (b=16), Sampling temperature (t=0.9), Top-k (k=40, k=640), Nucleus (p=0.95) are specified as parameters).