Efficient Constrained K-center Clustering with Background Knowledge
Authors: Longkun Guo, Chaoqi Jia, Kewen Liao, Zhigang Lu, Minhui Xue
AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We also construct competitive baseline algorithms and empirically evaluate our approximation algorithm against them on a variety of real datasets. The results validate our theoretical findings and demonstrate the great advantages of our algorithm in terms of clustering cost, clustering quality, and running time. |
| Researcher Affiliation | Academia | 1School of Mathematics and Statistics, Fuzhou University, Fuzhou 350116, China 2Key Laboratory of Computing Power Network and Information Security, Ministry of Education, Shandong Computer Science Center, Qilu University of Technology (Shandong Academy of Sciences), Jinan 250316, China 3Hilst Lab, Peter Faber Business School, Australian Catholic University, North Sydney 2060, Australia 4College of Science and Engineering, James Cook University, Townsville 4810, Australia 5CSIRO s Data61, Sydney 2015, Australia |
| Pseudocode | Yes | Algorithm 1: Approximating CL k-center via RDS. |
| Open Source Code | Yes | The complete version of this paper can be found at https://arxiv.org/abs/2401.12533 with datasets and codes at https://github.com/Chaoqi Jia/ML-CL-k-Center. |
| Open Datasets | Yes | We follow existing studies on constrained clustering (Wagstaff et al. 2001; Malkomes et al. 2015) to use the four real-world datasets (Wine, Cnae9, Skin and Covertype (Bache and Lichman 2013)) and two famous network traffic datasets (KDDTest+ of NLS-KDD (Tavallaee et al. 2009) and Wide09 of MAWI (Group 2020))... with datasets and codes at https://github.com/Chaoqi Jia/ML-CL-k-Center. |
| Dataset Splits | No | We construct both disjoint and intersected ML/CL constraints for the real-world and the simulated datasets in accordance with the Introduction and Algorithm to evaluate the clustering performance of our approximation algorithm against baselines. In short, for a given dataset, a given number of constrained data points, and a given number of participants (who have their own background knowledge of ML/CL), we uniformly sample data points from the raw dataset into different ML and CL sets. |
| Hardware Specification | Yes | We implemented all algorithms in Java 1.8.0 on a 64-bit Linux 3.10 high-performance computer, where each node equips an Intel Xeon Gold 6240 CPU and 32 GB RAM. |
| Software Dependencies | Yes | We implemented all algorithms in Java 1.8.0 on a 64-bit Linux 3.10 high-performance computer, where each node equips an Intel Xeon Gold 6240 CPU and 32 GB RAM. ... LPs can be efficiently solved in polynomial time using widely-used LP solvers like CPLEX (IBM 2022). [Reference: IBM. 2022. v22.1: User s manual for CPLEX.] |
| Experiment Setup | Yes | This section includes a brief description of the experimental configurations. We construct both disjoint and intersected ML/CL constraints for the real-world and the simulated datasets... we uniformly sample data points from the raw dataset into different ML and CL sets. ... We propose two baseline algorithms a greedy algorithm (Greedy) and a matching-based algorithm (Matching). ... We use the common clustering quality metrics in the experiments, which are Cost, Normalized Mutual Information (NMI) and Rand Index. For runtime, we report it in the base ten logarithms. ... Empirical Approximation Ratio is measured based on the clustering radius ratio obtained on the simulated dataset... obtained from 1, 000 runs of the algorithm (10 simulated datasets 10 distinct constrained cases 10 repeated runs). |