Faster DBSCAN via subsampled similarity queries

Authors: Heinrich Jiang, Jennifer Jang, Jakub Lacki

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We provide an extensive experimental analysis showing that on large datasets, one can subsample as little as 0.1% of the neighborhood graph, leading to as much as over 200x speedup and 250x reduction in RAM consumption compared to scikit-learn s implementation of DBSCAN, while still maintaining competitive clustering performance.
Researcher Affiliation Industry Heinrich Jiang Google Research heinrichj@google.com Jennifer Jang Waymo jangj@waymo.com Jakub Ł acki Google Research jlacki@google.com
Pseudocode Yes Algorithm 1 Subsampled Neighborhood Graph DBSCAN (SNG-DBSCAN) Inputs: Xrns, sampling rate s, ϵ, Min Pts Initialize graph G p Xrns, Hq. For each x P Xrns, sample rsns examples from Xrns, xi1, ..., xirsns and add edge px, xijq to G if |x xij| ď ϵ for j P rrsnss. Let K : t K1, ..., Kℓu be the connected components of the subgraph of G induced by vertices of degree at least Min Pts. Initialize Ci H for i P rℓs and define C : t C1, ..., Cℓu. For each x P Xrns, add x to Ci if x P Ki. Otherwise, if x is connected to some point in Ki, add x to Ci (if multiple such i P rℓs exist, choose one arbitrarily). return C.
Open Source Code No The paper mentions 'our implementation of SNG-DBSCAN' but does not provide an explicit statement about releasing the source code or a link to a repository.
Open Datasets Yes We compare the performance of SNG-DBSCAN against DBSCAN on 5 large (~1,000,000+ datapoints) and 12 smaller (~100 to ~100,000 datapoints) datasets from UCI [11] and Open ML [47].
Dataset Splits No The paper uses clustering algorithms on datasets and evaluates performance using labels. It does not mention explicit training, validation, or test dataset splits in the context of reproducing the experiment (e.g., for model training or hyperparameter tuning splits).
Hardware Specification No The paper mentions experiments were run 'on a cloud compute environment' and on 'cloud machines with up to 750GB of RAM'. However, it does not provide specific hardware details such as GPU/CPU models or explicit cloud instance types.
Software Dependencies No The paper mentions comparing 'our implementation of SNG-DBSCAN to that of sci-kit learn s DBSCAN [39], both of which are implemented with Cython'. However, no specific version numbers for scikit-learn, Cython, or other software components are provided.
Experiment Setup Yes The settings of Min Pts and range of ϵ that we ran SNG-DBSCAN and DBSCAN on, as well as how s was chosen, are shown in the Appendix. For simplicity we fixed Min Pts and only tuned ϵ which is the more essential hyperparameter. ... DBSCAN was run with Min Pts 10, and SNG-DBSCAN was run with Min Pts maxp2, t10 suq for all datasets.