Nearly-Optimal Hierarchical Clustering for Well-Clustered Graphs
Authors: Steinar Laenen, Bogdan Adrian Manghiuc, He Sun
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To demonstrate the significance of our work, we experimentally compare our first algorithm against the previous state-of-the-art and a well-known linkage heuristic (Average Linkage) on both synthetic and real-world datasets. Our experimental results show that the trees constructed from our algorithm and Average Linkage achieve similar cost values, which are much lower than the ones constructed from Cohen-Addad et al. (2017) and Manghiuc & Sun (2021). Moreover, our algorithm runs significantly faster than the other three tested algorithms. |
| Researcher Affiliation | Academia | Steinar Laenen * 1 Bogdan-Adrian Manghiuc * 1 He Sun * 1 1School of Informatics, University of Edinburgh, United Kingdom. Correspondence to: Steinar Laenen <steinar.laenen@ed.ac.uk>, Bogdan-Adrian Manghiuc <b.a.manghiuc@sms.ed.ac.uk>, He Sun <h.sun@ed.ac.uk>. |
| Pseudocode | Yes | Algorithm 1 Spectral Clustering with Degree Bucketing and WRSC (Spec WRSC); Algorithm 2 Spectral Clustering with Degree Bucketing and Caterpillar Construction |
| Open Source Code | Yes | Our code can be downloaded from https://github.com/steinarlaenen/nearlyoptimal-hierarchical-clustering-for-well-clustered-graphs. |
| Open Datasets | Yes | We first follow the sequence of recent work on hierarchical clustering (Abboud et al., 2019; Cohen-Addad et al., 2017; Manghiuc & Sun, 2021; Menon et al., 2019; Roy & Pokutta, 2017), all of which are based on the following 4 datasets from the Scikit-learn library (Pedregosa et al., 2011) and the UCI ML repository (Dua & Graff, 2017): Iris (Ir), Wine (Wi), Cancer (Ca), and Newsgroup (Ng). |
| Dataset Splits | No | The paper mentions using synthetic and real-world datasets but does not explicitly describe training, validation, and test splits with specific percentages or counts. It refers to 'synthetic data' and 'real-world datasets' but without detail on how they were partitioned for training or validation. |
| Hardware Specification | Yes | All the tested algorithms were implemented in Python 3.9 and experiments were performed using a Lenovo Think Pad T15G, with an Intel(R) Xeon(R) W-10855M CPU@2.80GHz processor and 126 GB RAM. |
| Software Dependencies | Yes | All the tested algorithms were implemented in Python 3.9 |
| Experiment Setup | Yes | To compare the running time, we incrementally increase the number of vertices in each cluster {Pi}5 i=1 from nk = 100 to nk = 3,000, in increments of 200, and we fix p = 0.1 and q = 0.002. For each algorithm, we measure the running time in seconds, and the results are plotted in Figure 2(a)... To compare the quality of the constructed HC trees on SBMs, we fix the number of vertices inside each cluster {Pi}5 i=1 as nk = 1,000, the value q = 0.002, and consider different values of p [0.06, 0.2]. |