Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Just Wing It: Near-Optimal Estimation of Missing Mass in a Markovian Sequence
Authors: Ashwin Pananjady, Vidya Muthukumar, Andrew Thangaraj
JMLR 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Finally, we demonstrate the efficacy of our estimators both in simulations on canonical chains and on sequences constructed from natural language text. |
| Researcher Affiliation | Academia | Ashwin Pananjady EMAIL Schools of Industrial and Systems Engineering and Electrical and Computer Engineering Georgia Institute of Technology Atlanta, USA Vidya Muthukumar EMAIL Schools of Electrical and Computer Engineering and Industrial and Systems Engineering Georgia Institute of Technology Atlanta, USA Andrew Thangaraj EMAIL Department of Electrical Engineering Indian Institute of Technology Madras Chennai, India |
| Pseudocode | Yes | Algorithm 1 Linear time implementation of the Wing It estimator. Require: Sequence Xn = X1, . . . , Xn, Natural number τ 1: Initialize locations as a dictionary 2: for i = 1, . . . , n do 3: if Xi / locations then 4: Initialize locations[Xi] as a list 6: Append i to locations[Xi] 9: for i = 1, . . . , n do 10: if locations[Xi].first > i τ and locations[Xi].last < i + τ then 11: c M c M + 1/n 13: end for 14: return c M |
| Open Source Code | Yes | Code and the text used for the simulations are available at Thangaraj et al. (2024). Andrew Thangaraj, Ashwin Pananjady, and Vidya Muthukumar. Missing Mass of Markov Chains, 2024. URL https://github.com/andrewthan/Missing-Mass. |
| Open Datasets | Yes | In our final experiment, presented in Figure 4, we consider the text of the novel A Tale of Two Cities by Charles Dickens accessed through Project Gutenberg (Dickens, 1994). Charles Dickens. A Tale of Two Cities. Project Gutenberg, Urbana, IL, USA, 1994. URL gutenberg.org/ebooks/98. |
| Dataset Splits | Yes | We first split the sequence into the first one-third Z(1) = (X1, . . . , Xn/3) and the final onethird Z(2) = (X2n/3+1, . . . , Xn) and compute the random variable f M(Z(1), Z(2)) = 1 (n/3) Pn i=2n/3+1 I Xi / {X1, . . . , Xn/3} . For a given length n, approximately 15N/n trajectories were considered in the simulations with their starting points separated by n/15. |
| Hardware Specification | No | The paper does not explicitly describe the hardware used for running its experiments. |
| Software Dependencies | No | The paper mentions a "Python implementation" for Algorithm 1, but does not provide specific version numbers for Python or any other software libraries or dependencies. |
| Experiment Setup | Yes | We first split the sequence into the first one-third Z(1) = (X1, . . . , Xn/3) and the final onethird Z(2) = (X2n/3+1, . . . , Xn) and compute the random variable f M(Z(1), Z(2)) = 1 (n/3) Pn i=2n/3+1 I Xi / {X1, . . . , Xn/3} . Next, iterate τ = 1, 2, 4, . . . , 2 log2(n/6) in increasing order, and compute c MWing It(τ) on the sequence Z(1); denote this random variable by c MWing It(Z(1); τ) for convenience. We then set bτ to be the smallest τ among this set such that c MWing It(Z(1); τ) f M(Z(1), Z(2) 2 Ctune τ for a suitable choice of the constant Ctune > 0. If such an inequality is not satisfied for any τ in the prescribed list, then we set bτ = n/6. ... with the constant Ctune in Eq. (28) set to be 1. We simulate the sticky Markov chain (10) with p = 0.5... The best (power of 2) window sizes for p = 0.1, 0.5, 0.9 are observed through simulations to be, respectively, τ = 4, 16, 64. |