Hierarchical Clustering: $O(1)$-Approximation for Well-Clustered Graphs

Authors: Bogdan-Adrian Manghiuc, He Sun

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To demonstrate the significance of our work, we compare our algorithm against the previous state-of-the-art with similar approximation guarantee [10] and well-known linkage heuristics on both synthetic and real-world data sets. 5 Experiments We experimentally evaluate the performance of our proposed algorithm, and compare it against the three well-known linkage heuristics for computing hierarchical clustering trees, and different variants of the algorithm proposed in [10], i.e. Linkage++, on both synthetic and real-world data sets.
Researcher Affiliation Academia Bogdan-Adrian Manghiuc School of Informatics The University of Edinburgh b.a.manghiuc@sms.ed.ac.uk He Sun School of Informatics The University of Edinburgh h.sun@ed.ac.uk
Pseudocode Yes Algorithm 1: HCwith Degrees(G{V }) and Algorithm 2: Prune Merge(G, k)
Open Source Code No The paper does not provide an explicit statement or link for the open-source code of their methodology.
Open Datasets Yes Real-world data sets. To evaluate the performance of our algorithm on real-world data sets, we follow the sequence of recent work on hierarchical clustering [2, 10, 19, 24], all of which are based on the following 5 data sets from the Scikit-learn library [22] as well as the UCI ML repository [1]: Iris, Wine, Cancer, Boston and Newsgroup6. [1] UCI ML Repository. https://archive.ics.uci.edu/ml/index.php. Accessed: 2021-05-22.
Dataset Splits No The paper does not explicitly provide details about training, validation, and test dataset splits, such as percentages, counts, or specific predefined splits.
Hardware Specification Yes All algorithms were implemented in Python 3.8 and the experiments were performed using an Intel(R) Core(TM) i5-6500 CPU @ 3.20GHz processor, with 16 GB RAM.
Software Dependencies Yes All algorithms were implemented in Python 3.8
Experiment Setup Yes We first set the number of clusters as k = 3, and the number of vertices in each cluster {Pi}3 i=1 as 1, 000. We assume that any pair of vertices within each cluster is connected by an edge with probability p, and any pair of vertices from different clusters is connected by an edge with probability q. We fix the value q = 0.002, and consider different values of p [0.04, 0.2]. for each data set we construct the similarity graph based on the Gaussian kernel, in which the σ-value is chosen according to the standard heuristic [21].