reproducibilityindex.ai

Nearly-Optimal Hierarchical Clustering for Well-Clustered Graphs

Authors: Steinar Laenen, Bogdan Adrian Manghiuc, He Sun

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To demonstrate the significance of our work, we experimentally compare our first algorithm against the previous state-of-the-art and a well-known linkage heuristic (Average Linkage) on both synthetic and real-world datasets. Our experimental results show that the trees constructed from our algorithm and Average Linkage achieve similar cost values, which are much lower than the ones constructed from Cohen-Addad et al. (2017) and Manghiuc & Sun (2021). Moreover, our algorithm runs significantly faster than the other three tested algorithms.
Researcher Affiliation	Academia	Steinar Laenen * 1 Bogdan-Adrian Manghiuc * 1 He Sun * 1 1School of Informatics, University of Edinburgh, United Kingdom. Correspondence to: Steinar Laenen <steinar.laenen@ed.ac.uk>, Bogdan-Adrian Manghiuc <b.a.manghiuc@sms.ed.ac.uk>, He Sun <h.sun@ed.ac.uk>.
Pseudocode	Yes	Algorithm 1 Spectral Clustering with Degree Bucketing and WRSC (Spec WRSC); Algorithm 2 Spectral Clustering with Degree Bucketing and Caterpillar Construction
Open Source Code	Yes	Our code can be downloaded from https://github.com/steinarlaenen/nearlyoptimal-hierarchical-clustering-for-well-clustered-graphs.
Open Datasets	Yes	We first follow the sequence of recent work on hierarchical clustering (Abboud et al., 2019; Cohen-Addad et al., 2017; Manghiuc & Sun, 2021; Menon et al., 2019; Roy & Pokutta, 2017), all of which are based on the following 4 datasets from the Scikit-learn library (Pedregosa et al., 2011) and the UCI ML repository (Dua & Graff, 2017): Iris (Ir), Wine (Wi), Cancer (Ca), and Newsgroup (Ng).
Dataset Splits	No	The paper mentions using synthetic and real-world datasets but does not explicitly describe training, validation, and test splits with specific percentages or counts. It refers to 'synthetic data' and 'real-world datasets' but without detail on how they were partitioned for training or validation.
Hardware Specification	Yes	All the tested algorithms were implemented in Python 3.9 and experiments were performed using a Lenovo Think Pad T15G, with an Intel(R) Xeon(R) W-10855M CPU@2.80GHz processor and 126 GB RAM.
Software Dependencies	Yes	All the tested algorithms were implemented in Python 3.9
Experiment Setup	Yes	To compare the running time, we incrementally increase the number of vertices in each cluster {Pi}5 i=1 from nk = 100 to nk = 3,000, in increments of 200, and we fix p = 0.1 and q = 0.002. For each algorithm, we measure the running time in seconds, and the results are plotted in Figure 2(a)... To compare the quality of the constructed HC trees on SBMs, we fix the number of vertices inside each cluster {Pi}5 i=1 as nk = 1,000, the value q = 0.002, and consider different values of p [0.06, 0.2].