Linear Time Algorithms for k-means with Multi-Swap Local Search
Authors: Junyu Huang, Qilong Feng, Ziyun Huang, Jinhui Xu, Jianxin Wang
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirical experiments show that our proposed algorithms achieve better performances compared with branch and bound solver (Neur IPS 2022) and other existing state-of-the-art local search algorithms on both small and large datasets. |
| Researcher Affiliation | Academia | 1School of Computer Science and Engineering, Central South University, Changsha 410083, China 2Xiangjiang Laboratory, Changsha 410205, China 3Department of Computer Science and Software Engineering, Penn State Erie, The Behrend College 4Department of Computer Science and Engineering, State University of New York at Buffalo, NY, USA 5The Hunan Provincial Key Lab of Bioinformatics, Central South University, Changsha 410083, China |
| Pseudocode | Yes | Algorithm 1 k-means++ Input: An instance (P, k) of the k-means problem. Output: A set C Rd of centers with size at most k. ... Algorithm 2 MLS Input: An instance (P, k) of the k-means problem, parameters T and t. Output: A set C Rd of centers with size at most k. ... Algorithm 3 MLSP Input: An instance (P, k) of the k-means problem, parameters T, t, R , ϵ and η. Output: A set C Rd of centers with size at most k. |
| Open Source Code | No | The paper does not include an unambiguous statement or link indicating that the source code for the described methodology is publicly available. |
| Open Datasets | Yes | We evaluate the performance of our algorithms on 8 datasets used in [15] with sizes over 50,000, two datasets SUSY (5,000,000 17) and HIGGS (11,000,000 27) from the UCI Machine Learning Repository3, and one dataset SIFT (100,000,000 128) with size 100,000,000 used in [14]. (Footnote 3: https://archive.ics.uci.edu/ml/index.php) |
| Dataset Splits | No | The paper does not provide specific dataset split information (exact percentages, sample counts, citations to predefined splits, or detailed splitting methodology) needed to reproduce the data partitioning. |
| Hardware Specification | Yes | For hardware, all the experiments are conducted on 72 Intel Xeon Gold 6230 CPUs with 500GB memory. |
| Software Dependencies | No | The paper mentions several algorithms and methods but does not provide specific ancillary software details with version numbers (e.g., library or solver names with version numbers). |
| Experiment Setup | Yes | In our experiments, we choose centers from the datasets such that the problem can be solved by branch and bound solver (denoted as BB for short), which can serve as references of optimal solutions. We compare our MLSP and MLS algorithms with the BB method and other local search algorithms. Following the settings in [15], the centers obtained by different algorithms are projected to their closest data points in P, and the number of clusters k is set to be 3, 5 and 10. Each algorithm is executed for 10 times, and the average results with deviation, the best results and the average running time are given. ... For MLSP and MLS algorithms, we set the sampling rounds as T = 400 with a swap size t = 2. For MLSP algorithm, the failure upper bound is set to be R = 5. For parameters ϵ and η, we fix them as 0.5. |