Almost Linear Time Density Level Set Estimation via DBSCAN

Authors: Hossein Esfandiari, Vahab Mirrokni, Peilin Zhong7349-7357

AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In our empirical study, we show that our algorithm provides significant speedup over the previous algorithms, while achieving comparable solution quality.In addition to our theoretical guarantees, we provide an empirical study of the algorithms on real and synthetic data sets.
Researcher Affiliation Collaboration Hossein Esfandiari1, Vahab Mirrokni1, Peilin Zhong2 1Google Research 2Columbia University {esfandiari,mirrokni}@google.com, peilin.zhong@columbia.edu
Pseudocode Yes Algorithm 1 DBSCAN, Algorithm 2 Core Point Set Construction via Rounding, Algorithm 3 Near Linear time DBSCAN
Open Source Code No The paper does not provide an explicit statement or link for the open-source code of their described methodology. It only references public code used for a baseline (KDTree).
Open Datasets Yes Dataset (E) MNIST (Le Cun et al. 1998) contains 10 classes of handwritten digits. Dataset (G) mobile ... This dataset is available on Kaggle4. See https://www.kaggle.com/. Dataset (K) phonemes (Friedman, Hastie, and Tibshirani 2001) Dataset (L) You Tube8M5 contains video features (Abu-El-Haija et al. 2016). See http://research.google.com/youtube8m/download.html. Other datasets are available on UCI repository (Dua and Graff 2017).
Dataset Splits No The paper does not provide explicit training/validation/test dataset splits. It describes evaluating clustering quality on real-world and synthetic datasets, without defining separate sets for training, validation, or testing in the context of model development.
Hardware Specification Yes All experiments are performed on a machine with 32G main memory and Intel(R) Xeon(R) CPU @ 2.30GHz.
Software Dependencies Yes Adjusted Rand Index scores and Adjusted Mutual Information scores are evaluated via the python3 package scikit-learn with version 0.23.1.
Experiment Setup Yes In all experiments, we fixed k = 10. For each clustering task, DBSCAN++ needs to subsample m data points and computes core points among these subsamples. As suggested by (Jang and Jiang 2018), we set m = 0.1 n D/(D+4) for all experiments. For our algorithm, we set t, the number of hash functions, to be 2 ln(n). For experiments on synthetic datasets, we fix ε = 3 D for all algorithms.