reproducibilityindex.ai

Efficient Centroid-Linkage Clustering

Authors: Mohammadhossein Bateni, Laxman Dhulipala, Willem Fletcher, Kishen N. Gowda, D Ellis Hershkowitz, Rajesh Jayaram, Jakub Lacki

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We also evaluate our algorithm empirically. By leveraging a state-of-the-art nearest-neighbor search library, we obtain a fast and accurate Centroid-Linkage HAC algorithm. Compared to an existing state-of-the-art exact baseline, our implementation maintains the clustering quality while delivering up to a 36 speedup due to performing fewer distance comparisons.
Researcher Affiliation	Collaboration	Mohammad Hossein Bateni Google Research New York, USA Laxman Dhulipala University of Maryland College Park, USA Willem Fletcher Brown University Providence, USA Kishen N. Gowda University of Maryland College Park, USA D Ellis Hershkowitz Brown University Providence, USA Rajesh Jayaram Google Research New York, USA Jakub Łącki Google Research New York, USA
Pseudocode	Yes	Algorithm 1 Dynamic ANNS for Adaptive Updates. Algorithm 2 Handle-Merge. Algorithm 3 Approximate-HAC.
Open Source Code	Yes	Our implementation can be found at https://github.com/kishen19/Centroid HAC.
Open Datasets	Yes	We evaluate the clustering quality of our approximate Centroid HAC algorithm against ground truth clusterings using standard metrics such as the Adjusted Rand Index (ARI), Normalized Mutual Information (NMI), Dendrogram Purity, and the unsupervised Dasgupta cost [11]. Our primary objective is to assess the performance of our algorithm across various values of ϵ on a diverse set of benchmarks. We primarily compare our results with those obtained using exact Centroid HAC. For these experiments, we consider the standard benchmark clustering datasets iris, digits, cancer, wine, and faces from the UCI repository (obtained from the sklearn.datasets package). We also consider the MNIST dataset which contains images of grayscale digits between 0 and 9, and birds, a dataset containing images of 525 species of birds; see Appendix E for more details. The licenses of these datasets are as follows: iris (CC BY 4.0), wine (CC BY 4.0), cancer (CC BY 4.0), digits (CC BY 4.0), faces (CC BY 4.0), Covertype (CC BY 4.0), mnist (CC BY-SA 3.0 DEED), birds (CC0: Public Domain), and sift-1M (CC0: Public Domain).
Dataset Splits	No	The paper does not provide specific dataset split information (exact percentages, sample counts, citations to predefined splits, or detailed splitting methodology) needed to reproduce the data partitioning into train/validation/test sets. For clustering, the evaluation is often on the full dataset or specific dendrogram cuts rather than traditional supervised learning splits.
Hardware Specification	Yes	We run our experiments on a 96-core Dell Power Edge R940 (with two-way hyperthreading) machine with 4 2.4GHz Intel 24-core 8160 Xeon processors (with 33MB L3 cache) and 1.5TB of main memory.
Software Dependencies	Yes	Our programs are implemented in C++ and compiled with the g++ compiler (version 11.4) with -O3 flag.
Experiment Setup	Yes	For the approximate methods, we considered ϵ = 0.1. Table 1: The ARI and NMI scores of our approximate Centroid HAC implementations for ϵ = 0.1, 0.2, 0.4, and 0.8, versus Exact Centroid HAC.