Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Just Wing It: Near-Optimal Estimation of Missing Mass in a Markovian Sequence

Authors: Ashwin Pananjady, Vidya Muthukumar, Andrew Thangaraj

JMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Finally, we demonstrate the eﬃcacy of our estimators both in simulations on canonical chains and on sequences constructed from natural language text.
Researcher Affiliation	Academia	Ashwin Pananjady EMAIL Schools of Industrial and Systems Engineering and Electrical and Computer Engineering Georgia Institute of Technology Atlanta, USA Vidya Muthukumar EMAIL Schools of Electrical and Computer Engineering and Industrial and Systems Engineering Georgia Institute of Technology Atlanta, USA Andrew Thangaraj EMAIL Department of Electrical Engineering Indian Institute of Technology Madras Chennai, India
Pseudocode	Yes	Algorithm 1 Linear time implementation of the Wing It estimator. Require: Sequence Xn = X1, . . . , Xn, Natural number τ 1: Initialize locations as a dictionary 2: for i = 1, . . . , n do 3: if Xi / locations then 4: Initialize locations[Xi] as a list 6: Append i to locations[Xi] 9: for i = 1, . . . , n do 10: if locations[Xi].ﬁrst > i τ and locations[Xi].last < i + τ then 11: c M c M + 1/n 13: end for 14: return c M
Open Source Code	Yes	Code and the text used for the simulations are available at Thangaraj et al. (2024). Andrew Thangaraj, Ashwin Pananjady, and Vidya Muthukumar. Missing Mass of Markov Chains, 2024. URL https://github.com/andrewthan/Missing-Mass.
Open Datasets	Yes	In our ﬁnal experiment, presented in Figure 4, we consider the text of the novel A Tale of Two Cities by Charles Dickens accessed through Project Gutenberg (Dickens, 1994). Charles Dickens. A Tale of Two Cities. Project Gutenberg, Urbana, IL, USA, 1994. URL gutenberg.org/ebooks/98.
Dataset Splits	Yes	We ﬁrst split the sequence into the ﬁrst one-third Z(1) = (X1, . . . , Xn/3) and the ﬁnal onethird Z(2) = (X2n/3+1, . . . , Xn) and compute the random variable f M(Z(1), Z(2)) = 1 (n/3) Pn i=2n/3+1 I Xi / {X1, . . . , Xn/3} . For a given length n, approximately 15N/n trajectories were considered in the simulations with their starting points separated by n/15.
Hardware Specification	No	The paper does not explicitly describe the hardware used for running its experiments.
Software Dependencies	No	The paper mentions a "Python implementation" for Algorithm 1, but does not provide specific version numbers for Python or any other software libraries or dependencies.
Experiment Setup	Yes	We ﬁrst split the sequence into the ﬁrst one-third Z(1) = (X1, . . . , Xn/3) and the ﬁnal onethird Z(2) = (X2n/3+1, . . . , Xn) and compute the random variable f M(Z(1), Z(2)) = 1 (n/3) Pn i=2n/3+1 I Xi / {X1, . . . , Xn/3} . Next, iterate τ = 1, 2, 4, . . . , 2 log2(n/6) in increasing order, and compute c MWing It(τ) on the sequence Z(1); denote this random variable by c MWing It(Z(1); τ) for convenience. We then set bτ to be the smallest τ among this set such that c MWing It(Z(1); τ) f M(Z(1), Z(2) 2 Ctune τ for a suitable choice of the constant Ctune > 0. If such an inequality is not satisﬁed for any τ in the prescribed list, then we set bτ = n/6. ... with the constant Ctune in Eq. (28) set to be 1. We simulate the sticky Markov chain (10) with p = 0.5... The best (power of 2) window sizes for p = 0.1, 0.5, 0.9 are observed through simulations to be, respectively, τ = 4, 16, 64.