Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

MIROSTAT: A NEURAL TEXT DECODING ALGORITHM THAT DIRECTLY CONTROLS PERPLEXITY

Authors: Sourya Basu, Govardana Sachitanandam Ramachandran, Nitish Shirish Keskar, Lav R. Varshney

ICLR 2021 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments show that for low values of k and p, perplexity drops significantly with generated text length and leads to excessive repetitions (the boredom trap). Contrarily, for large values of k and p, perplexity increases with generated text length and leads to incoherence (confusion trap). Mirostat avoids both traps. Specifically, we show that setting target perplexity value beyond a threshold yields negligible sentence-level repetitions. Experiments with human raters for fluency, coherence, and quality further verify our findings.
Researcher Affiliation Collaboration Department of Electrical and Computer Engineering, University of Illinois at Urbana-Champaign Salesforce Research
Pseudocode Yes Algorithm 1: Mirostat sampling for perplexity control
Open Source Code Yes 2Code is available at https://github.com/basusourya/mirostat
Open Datasets Yes We use the GPT-2 LM with 117M parameters for all experiments (Radford et al., 2019) unless mentioned otherwise, and just refer to it as GPT-2.
Dataset Splits No The paper does not specify train/validation/test splits for the data used in their experiments, as they are primarily evaluating a text decoding algorithm on a pre-trained language model rather than training a new model.
Hardware Specification No The paper does not provide specific details on the hardware used to run the experiments (e.g., GPU/CPU models, memory, or cloud instance types).
Software Dependencies No The paper mentions using the GPT-2 LM but does not list any specific software dependencies or their version numbers (e.g., Python, PyTorch, TensorFlow, CUDA versions).
Experiment Setup Yes Alg. 1 details mirostat, which generates texts with predetermined average surprise value. The input is a target surprise value τ, which in turn initializes a variable µ = 2τ. Each word is sampled by first estimating s from (30) as ˆs, then using top-k sampling where k is a function of the estimated s and of the target surprise value of the output text. [...] Compute error: e = S(X) τ Update µ: µ = µ ηe. Also, for human evaluations: 'We generated 300 tokens using GPT-2 from a fixed context with average cross-entropy rate τ {2.5, 3, 4, 5} using both mirostat and top-p sampling.'