Near-Linear Time Approximation Algorithms for k-means with Outliers

Authors: Junyu Huang, Qilong Feng, Ziyun Huang, Jinhui Xu, Jianxin Wang

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirical experiments suggest that our proposed sampling-based algorithms outperform state-of-the-art algorithms for the k-means with outliers problem. In this section, we demonstrate the performance of our algorithms by comparing with state-of-the-art algorithms for the k-means with outliers problem.
Researcher Affiliation Academia 1School of Computer Science and Engineering, Central South University, Changsha 410083, China 2Xiangjiang Laboratory, Changsha 410205, China 3Department of Computer Science and Software Engineering, Penn State Erie, The Behrend College 4Department of Computer Science and Engineering, State University of New York at Buffalo, NY, USA 5The Hunan Provincial Key Lab of Bioinformatics, Central South University, Changsha 410083, China.
Pseudocode Yes Algorithm 1 Fast-Sampling(X, k, z, d, η, ϵ) Input: An instance (X, k, z, d) of the k-means with outliers, parameters η and ϵ. Output: A set C Rd. ... Algorithm 2 OSE(X, k, z, d, ϵ, R, C) Input: An instance (X, k, z, d) of the k-means with outliers, a parameter ϵ, a probability summation threshold R, and a set C of centers opened. Output: An estimation of the oversampling factor. ... Algorithm 3 Center-Reduction(X, k, z, d, ϵ, η, F) Input: An instance (X, k, z, d) of the k-means with outliers, parameters ϵ and η, a weighted approximation algorithm F with k centers opened and (1 + ϵ)z outliers discarded. Output: A set C Rd.
Open Source Code No The paper does not provide an explicit statement or link to open-source code for the methodology described.
Open Datasets Yes We test the performance of different algorithms on real-world datasets including 4 datasets used in (Im et al., 2020; Deshpande et al., 2020) and one large-scale dataset used in (Matsui et al., 2017). The datasets are Skin (245,057 3, k = 10), SUSY (5,000,000 18, k = 10), Shuttle (43,500 9, k = 10), KDDFULL (4,898,431 37, k = 3) and SIFT (100,000,000 128, k = 100). ... HIGGS-5 (11,000,000 28), HIGGS-10 (11,000,000 28) 2... 2https://archive.ics.uci.edu/dataset/280/higgs
Dataset Splits No The paper describes how outliers are generated and mentions using various datasets for evaluation, but it does not specify explicit training/validation/test splits or cross-validation methods for the main experimental datasets.
Hardware Specification Yes We perform our experiments on a machine with 100 Intel Xeon Gold 6230 CPUs and 1TB Memory.
Software Dependencies No The paper mentions several algorithms and methods (e.g., k-means++, LSDS++, ECOD, IFOREST, Sampling) and programming languages (e.g., in Appendix C.5, 'Gaussian distributions'), but it does not provide specific version numbers for any software libraries, frameworks, or languages used in the experiments.
Experiment Setup Yes In experiments, we fix ϵ, η and δ to be 0.5. The number of sampling iterations is multiplied by a factor of β for β = 1.5. The number of centers opened and outliers discarded for each dataset follows the same settings in (Im et al., 2020; Deshpande et al., 2020). In Appendix C.2, we also test the performance with varying k ranging from 5 to 50 and varying z ranging from 1% to 10%, respectively. For the SIFT dataset, since the data size is much larger than other datasets, moderate changes in k do not significantly influence the clustering cost. Thus, we test the values of k ranging from 50 to 200 on SIFT. In Appendix C.3, we conduct experiments on the performance of our proposed algorithms with varying β. We run all the algorithms on each dataset 10 times and report the best clustering cost and corresponding recall.