Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Convergence Guarantees for the Good-Turing Estimator
Authors: Amichai Painsky
JMLR 2022 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | An extensive empirical study which demonstrates the performance of the proposed estimator, compared to currently known schemes. The rest of the manuscript is organized as follows. Finally, in Section 8 we compare our suggested framework with currently known estimators in a series of synthetic and real-world experiments. |
| Researcher Affiliation | Academia | Amichai Painsky EMAIL Department of Industrial Engineering Tel Aviv University Tel Aviv, Israel |
| Pseudocode | No | The paper focuses on mathematical derivations, theorems, and proofs related to the Good-Turing estimator. It does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not contain any explicit statements about releasing source code, nor does it provide links to any code repositories. |
| Open Datasets | Yes | We begin with a corpus linguistic experiment. The popular Broadway play Hamilton consists of 20,520 words, of which m = 3,578 are distinct. Gao et al. (2007) considered the forearm skin biota of six subjects. Finally, we study census data. The lower row of Figure 5 considers the 2000 United States Census (Bureau, 2014), which lists the frequency of the top m = 1000 most common last names in the United States. |
| Dataset Splits | No | In each experiment we draw n samples, and compare the occupancy probabilities Mk(Xn) with their corresponding estimators, for di๏ฌerent values of k. To attain an averaged error, we repeat each experiment 1000 times, and average the squared error. The paper describes a sampling and resampling evaluation methodology rather than traditional dataset splits for training, validation, and testing. |
| Hardware Specification | No | The paper does not provide any specific details about the hardware used to run the experiments, such as GPU/CPU models or other computer specifications. |
| Software Dependencies | No | The paper does not mention any specific software or library names along with their version numbers that would be necessary to replicate the experiments. |
| Experiment Setup | No | The paper describes the mathematical formulations of the estimators and analyzes their convergence rates. While it discusses sample sizes (n) and k values for evaluation, it does not specify hyperparameters, training configurations, or system-level settings typically found in experimental setups for machine learning models. |