reproducibilityindex.ai

Hierarchical Agglomerative Graph Clustering in Nearly-Linear Time

Authors: Laxman Dhulipala, David Eisenstat, Jakub Łącki, Vahab Mirrokni, Jessica Shi

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We validate the performance of our algorithms on publicly available datasets, and show that our approach can speed up clustering of point datasets by a factor of 20.7 76.5x.
Researcher Affiliation	Collaboration	1MIT CSAIL 2Google Research.
Pseudocode	Yes	Algorithm 1 MERGE(A, B, L) ... Algorithm 2 Graph HAC-Chain(G = (V, E, w), L) ... Algorithm 3 Graph HAC-Heap(G = (V, E, w), L) ... Algorithm 4 FLIPEDGE(A, B) ... Algorithm 5 UPDATEORIENTATION(A, B, EO)
Open Source Code	Yes	We have made our implementations publicly available on Github. 2 Footnote 2: https://github.com/Par Alg/gbbs/tree/ master/benchmarks/Clustering/Seq HAC
Open Datasets	Yes	We ran this experiment on a collection of large real-world graphs from the SNAP datasets. ... We evaluate our algorithms and the four HAC variants supported by sklearn on the iris, wine, digits, and cancer, and faces classiﬁcation datasets. ... We use the Fashion-MNIST (764-dimensions), Last.fm (65 dimensions), and NYTimes (256 dimensions) datasets in these experiments.
Dataset Splits	No	The paper mentions evaluating on datasets and using 'slices' of datasets, but does not provide specific train/validation/test splits (percentages, counts, or references to predefined splits) for reproducibility.
Hardware Specification	Yes	We run our experiments on a 72-core machine with 4 2.4GHz Intel 18-core E7-8867 v4 Xeon processors, and 1TB of memory.
Software Dependencies	No	The paper mentions using C++, GBBS, PAM, ScaNN, Scikit-learn, Sci Py, and Fastcluster, but does not provide specific version numbers for these software components.
Experiment Setup	Yes	Our approximate average-linkage algorithm (using ϵ = 0.1)