Data Amplification: Instance-Optimal Property Estimation

Authors: Yi Hao, Alon Orlitsky

ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate the efficacy of the proposed estimators by comparing their performance to two state-of-the-art estimators (Wu & Yang, 2016; 2019), and empirical estimators with logarithmic larger sample sizes. Due to method similarity, we present only the results for entropy and support size. Additional estimators for both properties were compared in Orlitsky et al. (2016); Wu & Yang (2016; 2019); Hao et al. (2018); Hao & Orlitsky (2019a) and found to perform similarly to or worse than the estimators we tested, hence we exclude them here. For each property, we considered nine natural-synthetic distributions, shown in Figure 1 and 2. Settings We experimented with nine distributions having support size S = 10,000: uniform distribution; a two-steps distribution with probability values 0.5S 1 and 1.5S 1; Zipf distribution with power 1/2; Zipf distribution with power 1; binomial distribution with success probability 0.3; geometric distribution with success probability 0.9; Poisson distribution with mean 0.3S; a distribution drawn from Dirichlet prior with parameter 1; a distribution drawn from Dirichlet prior with parameter 1/2.
Researcher Affiliation Academia 1Department of Electrical and Computer Engineering, University of California, San Diego, USA.
Pseudocode No The paper describes the steps for estimator construction and computation (e.g., "first split the samples into two halves", "classify the symbols", "Compute G'(x)"), but it does not present these steps in a formal pseudocode or algorithm block format.
Open Source Code No The paper does not provide any explicit statements about open-source code availability or links to code repositories.
Open Datasets No The paper describes using "nine natural-synthetic distributions" (e.g., uniform, Zipf, binomial, Poisson, Dirichlet prior) and explains how they were set up or drawn from. These are not existing publicly available datasets with concrete access information (links, DOIs, or citations to specific public repositories) as required by the question, but rather distributions from which data was generated for the experiments.
Dataset Splits No The paper describes repeating experiments 100 times and varying sample size 'n' for evaluation. It does not mention or specify training, validation, or test splits for any dataset, nor does it refer to cross-validation.
Hardware Specification No The paper does not provide any specific details about the hardware used to run the experiments (e.g., CPU models, GPU models, memory specifications).
Software Dependencies No The paper does not mention any specific software dependencies with version numbers.
Experiment Setup Yes We chose the parameter ε = 1. The geometric, Poisson, and Zipf distributions were truncated at S and re-normalized. The horizontal axis shows the number of samples, n, ranging from S0.2 to S. Each experiment was repeated 100 times and the reported results, shown on the vertical axis, reflect their mean values and standard deviations.