reproducibilityindex.ai

Efficient Constrained K-center Clustering with Background Knowledge

Authors: Longkun Guo, Chaoqi Jia, Kewen Liao, Zhigang Lu, Minhui Xue

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We also construct competitive baseline algorithms and empirically evaluate our approximation algorithm against them on a variety of real datasets. The results validate our theoretical findings and demonstrate the great advantages of our algorithm in terms of clustering cost, clustering quality, and running time.
Researcher Affiliation	Academia	1School of Mathematics and Statistics, Fuzhou University, Fuzhou 350116, China 2Key Laboratory of Computing Power Network and Information Security, Ministry of Education, Shandong Computer Science Center, Qilu University of Technology (Shandong Academy of Sciences), Jinan 250316, China 3Hilst Lab, Peter Faber Business School, Australian Catholic University, North Sydney 2060, Australia 4College of Science and Engineering, James Cook University, Townsville 4810, Australia 5CSIRO s Data61, Sydney 2015, Australia
Pseudocode	Yes	Algorithm 1: Approximating CL k-center via RDS.
Open Source Code	Yes	The complete version of this paper can be found at https://arxiv.org/abs/2401.12533 with datasets and codes at https://github.com/Chaoqi Jia/ML-CL-k-Center.
Open Datasets	Yes	We follow existing studies on constrained clustering (Wagstaff et al. 2001; Malkomes et al. 2015) to use the four real-world datasets (Wine, Cnae9, Skin and Covertype (Bache and Lichman 2013)) and two famous network traffic datasets (KDDTest+ of NLS-KDD (Tavallaee et al. 2009) and Wide09 of MAWI (Group 2020))... with datasets and codes at https://github.com/Chaoqi Jia/ML-CL-k-Center.
Dataset Splits	No	We construct both disjoint and intersected ML/CL constraints for the real-world and the simulated datasets in accordance with the Introduction and Algorithm to evaluate the clustering performance of our approximation algorithm against baselines. In short, for a given dataset, a given number of constrained data points, and a given number of participants (who have their own background knowledge of ML/CL), we uniformly sample data points from the raw dataset into different ML and CL sets.
Hardware Specification	Yes	We implemented all algorithms in Java 1.8.0 on a 64-bit Linux 3.10 high-performance computer, where each node equips an Intel Xeon Gold 6240 CPU and 32 GB RAM.
Software Dependencies	Yes	We implemented all algorithms in Java 1.8.0 on a 64-bit Linux 3.10 high-performance computer, where each node equips an Intel Xeon Gold 6240 CPU and 32 GB RAM. ... LPs can be efficiently solved in polynomial time using widely-used LP solvers like CPLEX (IBM 2022). [Reference: IBM. 2022. v22.1: User s manual for CPLEX.]
Experiment Setup	Yes	This section includes a brief description of the experimental configurations. We construct both disjoint and intersected ML/CL constraints for the real-world and the simulated datasets... we uniformly sample data points from the raw dataset into different ML and CL sets. ... We propose two baseline algorithms a greedy algorithm (Greedy) and a matching-based algorithm (Matching). ... We use the common clustering quality metrics in the experiments, which are Cost, Normalized Mutual Information (NMI) and Rand Index. For runtime, we report it in the base ten logarithms. ... Empirical Approximation Ratio is measured based on the clustering radius ratio obtained on the simulated dataset... obtained from 1, 000 runs of the algorithm (10 simulated datasets 10 distinct constrained cases 10 repeated runs).