Scalable DBSCAN with Random Projections

Authors: HaoChuan Xu, Ninh Pham

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically, s DBSCAN is significantly faster and provides higher accuracy than competitive DBSCAN variants on real-world million-point data sets.
Researcher Affiliation Academia Hao Chuan Xu School of Computer Science University of Auckland hxu612@aucklanduni.ac.nz Ninh Pham School of Computer Science University of Auckland ninh.pham@auckland.ac.nz
Pseudocode Yes Algorithm 1 DBSCAN
Open Source Code Yes Our code is available at https://github.com/Ninh Pham/s Dbscan.
Open Datasets Yes We conduct experiments on three popular data sets: Mnist (n = 70, 000, d = 784, # clusters = 10), Pamap2 (n = 1, 770, 131, d = 51, # clusters = 18), and Mnist8m (n = 8, 100, 000, d = 784, # clusters = 10).
Dataset Splits No We conduct experiments on three popular data sets: Mnist (n = 70, 000, d = 784, # clusters = 10), Pamap2 (n = 1, 770, 131, d = 51, # clusters = 18), and Mnist8m (n = 8, 100, 000, d = 784, # clusters = 10).
Hardware Specification Yes We conducted experiments on Ubuntu 20.04.4 with an AMD Ryzen Threadripper 3970X 2.2GHz 32-core processor (64 threads) with 128GB of DRAM.
Software Dependencies Yes We implement s DBSCAN and s OPTICS in C++ and compile with g++ -O3 -std=c++17 -fopenmp -march=native. We use the Eigen library 2 for SIMD vectorization on computing the distances.
Experiment Setup Yes Parameter settings. We consider min Pts = 50 for all experiments. s DBSCAN and s OPTICS use D = 1024, m = min Pts. Randomized kernel embeddings use σ = 2ε, d = 1024. We use k = 5 for Mnist and k = 10 for Pamap2 and Mnist8m.