Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Average Sensitivity of Hierarchical $k$-Median Clustering
Authors: Shijie Li, Weiqiang He, Ruobing Bai, Pan Peng
ICML 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We propose an efficient algorithm for hierarchical k-median clustering and theoretically prove its low average sensitivity and high clustering quality. Additionally, we show that single linkage clustering and a deterministic variant of the CLNSS algorithm exhibit high average sensitivity, making them less stable. Finally, we validate the robustness and effectiveness of our algorithm through experiments. |
| Researcher Affiliation | Academia | 1School of Computer Science and Technology, University of Science and Technology of China, Hefei, China. Correspondence to: Pan Peng <EMAIL>. |
| Pseudocode | Yes | Algorithm 1 (Cohen-Addad et al., 2021) GREEDY ALGORITHM FOR HIERARCHICAL k-MEDIAN ON 2-RHST Algorithm 2 A LOW-SENSITIVITY ALGORITHM FOR HI-ERARCHICAL k-MEDIAN CLUSTERING Algorithm 3 AGGLOMERATIVE ALGORITHM Algorithm 4 CONSTRUCT2RHST(P) Algorithm 5 THE CLNSS ALGORITHM |
| Open Source Code | No | The paper does not contain an explicit statement about releasing their code, nor does it provide a direct link to a code repository for the methodology described. |
| Open Datasets | Yes | We use datasets from the Scikit-learn library repository (Pedregosa et al., 2011) and the UCI Machine Learning Repository (Asuncion et al., 2007) to evaluate our algorithm. See Table 1 for more detailed description. |
| Dataset Splits | Yes | Given a dataset, we generate perturbed datasets by removing certain points as follows: (1) For each deletion setting (1 point, 1%, 5%, and 10% deletions), conduct 100 independent trials. (2) In each trial, randomly remove the specified number or percentage of points from the original dataset to create a perturbed version. |
| Hardware Specification | Yes | All algorithms were implemented in Python 3.10, and the experiments were conducted on a high-performance computing system featuring a 128-core Intel(R) Xeon(R) Platinum 8358 CPU and 504GB of RAM. |
| Software Dependencies | No | The paper mentions "Python 3.10" and "scikit-learn" but only provides a specific version for Python. It does not list multiple key software components with their versions. |
| Experiment Setup | Yes | In the experiments comparing our algorithm with the CLNSS algorithm and single linkage on these datasets, we set the value of ε to 1, k to 2, and the dataset sizes ranged from 50 to 300 in increments of 50. Specifically, we generated two types of datasets: In Fig. 2a, the data points start from the coordinate origin and are distributed along a random line with progressively increasing distance intervals, while in Fig. 2b, the data points are concentrated at a position in the Euclidean space corresponding to the third-layer nodes of the RHST tree, with settings identical to those described in the lemmas. ... For a fixed k, we evaluate our algorithm with three values of ε: 1, 10, and 1000. |