Faster DBSCAN via subsampled similarity queries
Authors: Heinrich Jiang, Jennifer Jang, Jakub Lacki
NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We provide an extensive experimental analysis showing that on large datasets, one can subsample as little as 0.1% of the neighborhood graph, leading to as much as over 200x speedup and 250x reduction in RAM consumption compared to scikit-learn s implementation of DBSCAN, while still maintaining competitive clustering performance. |
| Researcher Affiliation | Industry | Heinrich Jiang Google Research heinrichj@google.com Jennifer Jang Waymo jangj@waymo.com Jakub Ł acki Google Research jlacki@google.com |
| Pseudocode | Yes | Algorithm 1 Subsampled Neighborhood Graph DBSCAN (SNG-DBSCAN) Inputs: Xrns, sampling rate s, ϵ, Min Pts Initialize graph G p Xrns, Hq. For each x P Xrns, sample rsns examples from Xrns, xi1, ..., xirsns and add edge px, xijq to G if |x xij| ď ϵ for j P rrsnss. Let K : t K1, ..., Kℓu be the connected components of the subgraph of G induced by vertices of degree at least Min Pts. Initialize Ci H for i P rℓs and define C : t C1, ..., Cℓu. For each x P Xrns, add x to Ci if x P Ki. Otherwise, if x is connected to some point in Ki, add x to Ci (if multiple such i P rℓs exist, choose one arbitrarily). return C. |
| Open Source Code | No | The paper mentions 'our implementation of SNG-DBSCAN' but does not provide an explicit statement about releasing the source code or a link to a repository. |
| Open Datasets | Yes | We compare the performance of SNG-DBSCAN against DBSCAN on 5 large (~1,000,000+ datapoints) and 12 smaller (~100 to ~100,000 datapoints) datasets from UCI [11] and Open ML [47]. |
| Dataset Splits | No | The paper uses clustering algorithms on datasets and evaluates performance using labels. It does not mention explicit training, validation, or test dataset splits in the context of reproducing the experiment (e.g., for model training or hyperparameter tuning splits). |
| Hardware Specification | No | The paper mentions experiments were run 'on a cloud compute environment' and on 'cloud machines with up to 750GB of RAM'. However, it does not provide specific hardware details such as GPU/CPU models or explicit cloud instance types. |
| Software Dependencies | No | The paper mentions comparing 'our implementation of SNG-DBSCAN to that of sci-kit learn s DBSCAN [39], both of which are implemented with Cython'. However, no specific version numbers for scikit-learn, Cython, or other software components are provided. |
| Experiment Setup | Yes | The settings of Min Pts and range of ϵ that we ran SNG-DBSCAN and DBSCAN on, as well as how s was chosen, are shown in the Appendix. For simplicity we fixed Min Pts and only tuned ϵ which is the more essential hyperparameter. ... DBSCAN was run with Min Pts 10, and SNG-DBSCAN was run with Min Pts maxp2, t10 suq for all datasets. |