Hierarchical Agglomerative Graph Clustering in Nearly-Linear Time

Authors: Laxman Dhulipala, David Eisenstat, Jakub Łącki, Vahab Mirrokni, Jessica Shi

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We validate the performance of our algorithms on publicly available datasets, and show that our approach can speed up clustering of point datasets by a factor of 20.7 76.5x.
Researcher Affiliation Collaboration 1MIT CSAIL 2Google Research.
Pseudocode Yes Algorithm 1 MERGE(A, B, L) ... Algorithm 2 Graph HAC-Chain(G = (V, E, w), L) ... Algorithm 3 Graph HAC-Heap(G = (V, E, w), L) ... Algorithm 4 FLIPEDGE(A, B) ... Algorithm 5 UPDATEORIENTATION(A, B, EO)
Open Source Code Yes We have made our implementations publicly available on Github. 2 Footnote 2: https://github.com/Par Alg/gbbs/tree/ master/benchmarks/Clustering/Seq HAC
Open Datasets Yes We ran this experiment on a collection of large real-world graphs from the SNAP datasets. ... We evaluate our algorithms and the four HAC variants supported by sklearn on the iris, wine, digits, and cancer, and faces classification datasets. ... We use the Fashion-MNIST (764-dimensions), Last.fm (65 dimensions), and NYTimes (256 dimensions) datasets in these experiments.
Dataset Splits No The paper mentions evaluating on datasets and using 'slices' of datasets, but does not provide specific train/validation/test splits (percentages, counts, or references to predefined splits) for reproducibility.
Hardware Specification Yes We run our experiments on a 72-core machine with 4 2.4GHz Intel 18-core E7-8867 v4 Xeon processors, and 1TB of memory.
Software Dependencies No The paper mentions using C++, GBBS, PAM, ScaNN, Scikit-learn, Sci Py, and Fastcluster, but does not provide specific version numbers for these software components.
Experiment Setup Yes Our approximate average-linkage algorithm (using ϵ = 0.1)