Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
DBSCAN++: Towards fast and scalable density clustering
Authors: Jennifer Jang, Heinrich Jiang
ICML 2019 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We show empirically that, compared to traditional DBSCAN, DBSCAN++ can provide not only competitive performance but also added robustness in the bandwidth hyperparameter while taking a fraction of the runtime. We show on both simulated datasets and real datasets that DBSCAN++ runs in a fraction of the time compared to DBSCAN, while giving competitive performance and consistently producing more robust clustering scores across hyperparameter settings. |
| Researcher Affiliation | Industry | Jennifer Jang 1 Heinrich Jiang 2 1Uber 2Google Research. |
| Pseudocode | Yes | Algorithm 1 DBSCAN; Algorithm 2 DBSCAN++; Algorithm 3 Greedy K-center Initialization. |
| Open Source Code | No | The paper does not provide any explicit statement or link indicating that the source code for the described methodology is publicly available. |
| Open Datasets | Yes | We used Phonemes (Friedman et al., 2001), a dataset of log periodograms of spoken phonemes, and MNIST, a sub-sample of the MNIST handwriting recognition dataset after running a PCA down to 20 dimensions. The rest of the datasets we used are standard UCI or Kaggle datasets used for clustering. |
| Dataset Splits | No | The paper lists datasets and mentions tuning parameters on 'p' values via validation, but it does not provide specific train/validation/test dataset splits (e.g., percentages, absolute counts, or explicit splitting methodology) needed to reproduce the data partitioning. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used for running experiments, such as CPU or GPU models, or memory specifications. |
| Software Dependencies | No | The paper does not list specific software dependencies with version numbers (e.g., Python, TensorFlow, PyTorch versions, or specific libraries with their versions) that would be needed to replicate the experiment. |
| Experiment Setup | Yes | We fixed min Pts = 10 for all procedures throughout experiments. DBSCAN was initiated with hyperparameters ε = 8 and min Pts = 10, and DBSCAN++ with ε = 60, m/n = 0.3, and min Pts = 10. |