MIROSTAT: A NEURAL TEXT DECODING ALGORITHM THAT DIRECTLY CONTROLS PERPLEXITY

Authors: Sourya Basu, Govardana Sachitanandam Ramachandran, Nitish Shirish Keskar, Lav R. Varshney

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments show that for low values of k and p, perplexity drops significantly with generated text length and leads to excessive repetitions (the boredom trap). Contrarily, for large values of k and p, perplexity increases with generated text length and leads to incoherence (confusion trap). Mirostat avoids both traps. Specifically, we show that setting target perplexity value beyond a threshold yields negligible sentence-level repetitions. Experiments with human raters for fluency, coherence, and quality further verify our findings.
Researcher Affiliation Collaboration Department of Electrical and Computer Engineering, University of Illinois at Urbana-Champaign Salesforce Research
Pseudocode Yes Algorithm 1: Mirostat sampling for perplexity control
Open Source Code Yes 2Code is available at https://github.com/basusourya/mirostat
Open Datasets Yes We use the GPT-2 LM with 117M parameters for all experiments (Radford et al., 2019) unless mentioned otherwise, and just refer to it as GPT-2.
Dataset Splits No The paper does not specify train/validation/test splits for the data used in their experiments, as they are primarily evaluating a text decoding algorithm on a pre-trained language model rather than training a new model.
Hardware Specification No The paper does not provide specific details on the hardware used to run the experiments (e.g., GPU/CPU models, memory, or cloud instance types).
Software Dependencies No The paper mentions using the GPT-2 LM but does not list any specific software dependencies or their version numbers (e.g., Python, PyTorch, TensorFlow, CUDA versions).
Experiment Setup Yes Alg. 1 details mirostat, which generates texts with predetermined average surprise value. The input is a target surprise value τ, which in turn initializes a variable µ = 2τ. Each word is sampled by first estimating s from (30) as ˆs, then using top-k sampling where k is a function of the estimated s and of the target surprise value of the output text. [...] Compute error: e = S(X) τ Update µ: µ = µ ηe. Also, for human evaluations: 'We generated 300 tokens using GPT-2 from a fixed context with average cross-entropy rate τ {2.5, 3, 4, 5} using both mirostat and top-p sampling.'