Near-Linear Time Approximation Algorithms for k-means with Outliers
Authors: Junyu Huang, Qilong Feng, Ziyun Huang, Jinhui Xu, Jianxin Wang
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirical experiments suggest that our proposed sampling-based algorithms outperform state-of-the-art algorithms for the k-means with outliers problem. In this section, we demonstrate the performance of our algorithms by comparing with state-of-the-art algorithms for the k-means with outliers problem. |
| Researcher Affiliation | Academia | 1School of Computer Science and Engineering, Central South University, Changsha 410083, China 2Xiangjiang Laboratory, Changsha 410205, China 3Department of Computer Science and Software Engineering, Penn State Erie, The Behrend College 4Department of Computer Science and Engineering, State University of New York at Buffalo, NY, USA 5The Hunan Provincial Key Lab of Bioinformatics, Central South University, Changsha 410083, China. |
| Pseudocode | Yes | Algorithm 1 Fast-Sampling(X, k, z, d, η, ϵ) Input: An instance (X, k, z, d) of the k-means with outliers, parameters η and ϵ. Output: A set C Rd. ... Algorithm 2 OSE(X, k, z, d, ϵ, R, C) Input: An instance (X, k, z, d) of the k-means with outliers, a parameter ϵ, a probability summation threshold R, and a set C of centers opened. Output: An estimation of the oversampling factor. ... Algorithm 3 Center-Reduction(X, k, z, d, ϵ, η, F) Input: An instance (X, k, z, d) of the k-means with outliers, parameters ϵ and η, a weighted approximation algorithm F with k centers opened and (1 + ϵ)z outliers discarded. Output: A set C Rd. |
| Open Source Code | No | The paper does not provide an explicit statement or link to open-source code for the methodology described. |
| Open Datasets | Yes | We test the performance of different algorithms on real-world datasets including 4 datasets used in (Im et al., 2020; Deshpande et al., 2020) and one large-scale dataset used in (Matsui et al., 2017). The datasets are Skin (245,057 3, k = 10), SUSY (5,000,000 18, k = 10), Shuttle (43,500 9, k = 10), KDDFULL (4,898,431 37, k = 3) and SIFT (100,000,000 128, k = 100). ... HIGGS-5 (11,000,000 28), HIGGS-10 (11,000,000 28) 2... 2https://archive.ics.uci.edu/dataset/280/higgs |
| Dataset Splits | No | The paper describes how outliers are generated and mentions using various datasets for evaluation, but it does not specify explicit training/validation/test splits or cross-validation methods for the main experimental datasets. |
| Hardware Specification | Yes | We perform our experiments on a machine with 100 Intel Xeon Gold 6230 CPUs and 1TB Memory. |
| Software Dependencies | No | The paper mentions several algorithms and methods (e.g., k-means++, LSDS++, ECOD, IFOREST, Sampling) and programming languages (e.g., in Appendix C.5, 'Gaussian distributions'), but it does not provide specific version numbers for any software libraries, frameworks, or languages used in the experiments. |
| Experiment Setup | Yes | In experiments, we fix ϵ, η and δ to be 0.5. The number of sampling iterations is multiplied by a factor of β for β = 1.5. The number of centers opened and outliers discarded for each dataset follows the same settings in (Im et al., 2020; Deshpande et al., 2020). In Appendix C.2, we also test the performance with varying k ranging from 5 to 50 and varying z ranging from 1% to 10%, respectively. For the SIFT dataset, since the data size is much larger than other datasets, moderate changes in k do not significantly influence the clustering cost. Thus, we test the values of k ranging from 50 to 200 on SIFT. In Appendix C.3, we conduct experiments on the performance of our proposed algorithms with varying β. We run all the algorithms on each dataset 10 times and report the best clustering cost and corresponding recall. |