reproducibilityindex.ai

Almost Linear Time Density Level Set Estimation via DBSCAN

Authors: Hossein Esfandiari, Vahab Mirrokni, Peilin Zhong7349-7357

AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In our empirical study, we show that our algorithm provides signiﬁcant speedup over the previous algorithms, while achieving comparable solution quality.In addition to our theoretical guarantees, we provide an empirical study of the algorithms on real and synthetic data sets.
Researcher Affiliation	Collaboration	Hossein Esfandiari1, Vahab Mirrokni1, Peilin Zhong2 1Google Research 2Columbia University {esfandiari,mirrokni}@google.com, peilin.zhong@columbia.edu
Pseudocode	Yes	Algorithm 1 DBSCAN, Algorithm 2 Core Point Set Construction via Rounding, Algorithm 3 Near Linear time DBSCAN
Open Source Code	No	The paper does not provide an explicit statement or link for the open-source code of their described methodology. It only references public code used for a baseline (KDTree).
Open Datasets	Yes	Dataset (E) MNIST (Le Cun et al. 1998) contains 10 classes of handwritten digits. Dataset (G) mobile ... This dataset is available on Kaggle4. See https://www.kaggle.com/. Dataset (K) phonemes (Friedman, Hastie, and Tibshirani 2001) Dataset (L) You Tube8M5 contains video features (Abu-El-Haija et al. 2016). See http://research.google.com/youtube8m/download.html. Other datasets are available on UCI repository (Dua and Graff 2017).
Dataset Splits	No	The paper does not provide explicit training/validation/test dataset splits. It describes evaluating clustering quality on real-world and synthetic datasets, without defining separate sets for training, validation, or testing in the context of model development.
Hardware Specification	Yes	All experiments are performed on a machine with 32G main memory and Intel(R) Xeon(R) CPU @ 2.30GHz.
Software Dependencies	Yes	Adjusted Rand Index scores and Adjusted Mutual Information scores are evaluated via the python3 package scikit-learn with version 0.23.1.
Experiment Setup	Yes	In all experiments, we ﬁxed k = 10. For each clustering task, DBSCAN++ needs to subsample m data points and computes core points among these subsamples. As suggested by (Jang and Jiang 2018), we set m = 0.1 n D/(D+4) for all experiments. For our algorithm, we set t, the number of hash functions, to be 2 ln(n). For experiments on synthetic datasets, we ﬁx ε = 3 D for all algorithms.