Almost Linear Time Density Level Set Estimation via DBSCAN
Authors: Hossein Esfandiari, Vahab Mirrokni, Peilin Zhong7349-7357
AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In our empirical study, we show that our algorithm provides significant speedup over the previous algorithms, while achieving comparable solution quality.In addition to our theoretical guarantees, we provide an empirical study of the algorithms on real and synthetic data sets. |
| Researcher Affiliation | Collaboration | Hossein Esfandiari1, Vahab Mirrokni1, Peilin Zhong2 1Google Research 2Columbia University {esfandiari,mirrokni}@google.com, peilin.zhong@columbia.edu |
| Pseudocode | Yes | Algorithm 1 DBSCAN, Algorithm 2 Core Point Set Construction via Rounding, Algorithm 3 Near Linear time DBSCAN |
| Open Source Code | No | The paper does not provide an explicit statement or link for the open-source code of their described methodology. It only references public code used for a baseline (KDTree). |
| Open Datasets | Yes | Dataset (E) MNIST (Le Cun et al. 1998) contains 10 classes of handwritten digits. Dataset (G) mobile ... This dataset is available on Kaggle4. See https://www.kaggle.com/. Dataset (K) phonemes (Friedman, Hastie, and Tibshirani 2001) Dataset (L) You Tube8M5 contains video features (Abu-El-Haija et al. 2016). See http://research.google.com/youtube8m/download.html. Other datasets are available on UCI repository (Dua and Graff 2017). |
| Dataset Splits | No | The paper does not provide explicit training/validation/test dataset splits. It describes evaluating clustering quality on real-world and synthetic datasets, without defining separate sets for training, validation, or testing in the context of model development. |
| Hardware Specification | Yes | All experiments are performed on a machine with 32G main memory and Intel(R) Xeon(R) CPU @ 2.30GHz. |
| Software Dependencies | Yes | Adjusted Rand Index scores and Adjusted Mutual Information scores are evaluated via the python3 package scikit-learn with version 0.23.1. |
| Experiment Setup | Yes | In all experiments, we fixed k = 10. For each clustering task, DBSCAN++ needs to subsample m data points and computes core points among these subsamples. As suggested by (Jang and Jiang 2018), we set m = 0.1 n D/(D+4) for all experiments. For our algorithm, we set t, the number of hash functions, to be 2 ln(n). For experiments on synthetic datasets, we fix ε = 3 D for all algorithms. |