Near-Optimal Density Estimation in Near-Linear Time Using Variable-Width Histograms

Authors: Siu On Chan, Ilias Diakonikolas, Rocco A. Servedio, Xiaorui Sun

NeurIPS 2014 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Theoretical The main contribution of this paper is a highly efficient density estimation algorithm for learning using a variable-width histogram, i.e., a hypothesis distribution with a piecewise constant probability density function. In more detail, for any k and ", we give an algorithm that makes O(k/"2) draws from p, runs in O(k/"2) time, and outputs a hypothesis distribution h that is piecewise constant with O(k log2(1/")) pieces. With high probability the hypothesis h satisfies d TV(p, h) C optk(p) + ", where d TV denotes the total variation distance (statistical distance), C is a universal constant, and optk(p) is the smallest total variation distance between p and any k-piecewise constant distribution. The sample size and running time of our algorithm are optimal up to logarithmic factors. The approximation factor C in our result is inherent in the problem, as we prove that no algorithm with sample size bounded in terms of k and " can achieve C < 2 regardless of what kind of hypothesis distribution it uses.
Researcher Affiliation Collaboration Siu-On Chan Microsoft Research sochan@gmail.com Ilias Diakonikolas University of Edinburgh ilias.d@ed.ac.uk Rocco A. Servedio Columbia University rocco@cs.columbia.edu Xiaorui Sun Columbia University xiaoruisun@cs.columbia.edu
Pseudocode Yes Algorithm Learn-WB-small-opt-k-histogram: Input: parameters k 1, " > 0; access to i.i.d. draws from target distribution p over [0, 1) Output: If (i) p is "/ log(1/") 384k -well-behaved and (ii) optk(p) ", then with probability at least 99/100 the output is a distribution q such that d TV(p, q) 2optk(p) + 3".
Open Source Code No The paper does not provide any statement or link indicating that source code for the described methodology is publicly available.
Open Datasets No The paper is theoretical and focuses on sample complexity and algorithm design, not on empirical evaluation using specific datasets. No dataset is mentioned for training purposes.
Dataset Splits No The paper is theoretical and does not describe empirical experiments with data splits for validation. Thus, no validation splits are provided.
Hardware Specification No The paper is theoretical and focuses on algorithm design and analysis. It does not describe any hardware used for running experiments.
Software Dependencies No The paper is theoretical and describes an algorithm; it does not mention specific software dependencies or versions used for implementation or experimentation.
Experiment Setup No The paper focuses on theoretical algorithm design and analysis, not empirical experiments. Therefore, it does not include details about an experimental setup such as hyperparameters or system-level training settings.