Statistical-Computational Trade-offs for Density Estimation

Authors: Anders Aamand, Alexandr Andoni, Justin Chen, Piotr Indyk, Shyam Narayanan, Sandeep Silwal, Haike Xu

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We study the density estimation problem defined as follows: given k distributions p1, . . . , pk over a discrete domain [n], as well as a collection of samples chosen from a query distribution q over [n], output pi that is close to q. Recently [1] gave the first and only known result that achieves sublinear bounds in both the sampling complexity and the query time while preserving polynomial data structure space. However, their improvement over linear samples and time is only by subpolynomial factors. Our main result is a lower bound showing that, for a broad class of data structures, their bounds cannot be significantly improved. In particular, if an algorithm uses O(n/ logc k) samples for some constant c > 0 and polynomial space, then the query time of the data structure must be at least k1 O(1)/ log log k, i.e., close to linear in the number of distributions k. This is a novel statistical-computational trade-off for density estimation, demonstrating that any data structure must use close to a linear number of samples or take close to linear query time. The lower bound holds even in the realizable case where q = pi for some i, and when the distributions are flat (specifically, all distributions are uniform over half of the domain [n]). We also give a simple data structure for our lower bound instance with asymptotically matching upper bounds. Experiments show that the data structure is quite efficient in practice.
Researcher Affiliation Collaboration Anders Aamand University of Copenhagen aamand@mit.edu Alexandr Andoni Columbia University andoni@cs.columbia.edu Justin Y. Chen MIT justc@mit.edu Piotr Indyk MIT indyk@mit.edu Shyam Narayanan Citadel Securities shyam.s.narayanan@gmail.com Sandeep Silwal UW-Madison silwal@cs.wisc.edu Haike Xu MIT haikexu@mit.edu
Pseudocode Yes Algorithm 1 Density estimation for half-uniform distributions (preprocessing) ... Algorithm 2 Query algorithm for half-uniform distributions
Open Source Code Yes Code is attached in the supplementary material.
Open Datasets Yes We test our algorithm in Section 4 experimentally on datasets of half-uniform distributions as in [1] and corresponding to our study in Sections 3 and 4.
Dataset Splits No The paper discusses input distributions and queries but does not explicitly describe train/validation/test dataset splits with percentages or counts.
Hardware Specification No The paper mentions running experiments but does not provide specific hardware details (e.g., CPU/GPU models, memory specifications) used.
Software Dependencies No The paper mentions using code from [1] as a basis but does not specify software names with version numbers for reproducibility.
Experiment Setup Yes Given these parameters, we evaluate the Subset algorithm and the Elimination algorithm on 100 random queries where each query corresponds to picking one of the input distributions as the true distribution to draw samples from. ... We start with a modest value of L = 200 and increase L by a factor of 1.5 repeatedly until the Subset algorithm also achieves 100% accuracy on the queries (in reality, it s failure probability will likely still be greater than that of the Elimination algorithm). The results we report correspond to this smallest value of L for which the algorithm got all the queries correct.