Data Structures for Density Estimation
Authors: Anders Aamand, Alexandr Andoni, Justin Y. Chen, Piotr Indyk, Shyam Narayanan, Sandeep Silwal
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | On the empirical side, we experimentally evaluate the linear-time algorithm and compare its performance to the (Acharya et al., 2018) algorithm on synthetic and on real networking data. These experiments display the practical benefits of the faster algorithm. For example, for synthetic data, we demonstrate that our faster algorithm achieves over 2x reduction in the number of comparisons needed to achieve the same level of accuracy. Similarly, our experiments on network data show up to 5x reduction in the number of comparisons. |
| Researcher Affiliation | Academia | 1Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA 02139, USA. 2Data Science Institute, Columbia University, New York, NY 10027, USA. |
| Pseudocode | Yes | Algorithm 1 Heavy light decomposition; Algorithm 2 Preprocessing; Algorithm 3 Sublinear Time Hypothesis Selection |
| Open Source Code | Yes | 3Code available at https://github.com/justc2/datastructdensityest |
| Open Datasets | Yes | The underlying network data we use comes from the CAIDA Anonymized Internet Trace internet traffic dataset4, which is IP traffic data collected at a backbone link of a Tier1 ISP in a data center in NYC. ... 4From CAIDA internet traces 2019, https: //www.caida.org/catalog/datasets/monitors/ passive-equinix-nyc/ |
| Dataset Splits | No | The paper describes its experimental setup including sample sizes and query formation (e.g., 'For a series of 100 query chunks, we use the prior 2,048 chunks as the set of distributions to search over'), but does not explicitly mention or specify training, validation, or testing splits in the conventional machine learning sense for model development. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used for experiments, such as CPU or GPU models, or memory specifications. |
| Software Dependencies | No | The paper does not list specific software dependencies with version numbers (e.g., programming language versions, library versions, or specific solver versions) used in the experiments. |
| Experiment Setup | Yes | The algorithms have several key parameters which we vary throughout the experiments. n All Pairs is the number of distributions in each level of the tournament which are randomly sampled to compete in an all-pairs tournament at the end of the algorithm. This parameter is used in both the base tournament and our fast tournament. fast Const controls the number of samples used in each level of the fast tournament. In particular, at the ith level, the fast tournament uses fast Const i samples for each Scheffe test. |