Global Optimal K-Medoids Clustering of One Million Samples
Authors: Jiayang Ren, Kaixun Hua, Yankai Cao
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive computational studies on 28 machine learning datasets demonstrate that our algorithm can provide a provable global optimal solution with an optimality gap of 0.1% within 4 hours on datasets with up to one million samples. |
| Researcher Affiliation | Academia | Jiayang Ren , Kaixun Hua , Yankai Cao University of British Columbia |
| Pseudocode | Yes | Algorithm 1 Tailored Reduced-space Branch and Bound Scheme for K-Medoids clustering, Algorithm 2 Probing, Algorithm 3 Cluster Assignment, Algorithm 4 Feasibility Based Bound Tightening, Algorithm 5 Lagrangian Based Lower Bounding |
| Open Source Code | Yes | The complete code files can be found in https: //github.com/Yankai Group/global_kmedoids_clustering |
| Open Datasets | Yes | To evaluate the performance, we use 25 datasets from the UCI Machine Learning Repository [26], one dataset called PR2392 from [27], one dataset called HEMI[28] and one synthetic dataset from [19]. |
| Dataset Splits | No | The paper evaluates a K-Medoids clustering solver on entire datasets and does not describe explicit training, validation, or test dataset splits for its experiments. |
| Hardware Specification | Yes | We executed all the experiments on a high-performance computing cluster, of which each node contains 40 Intel cores at 2.4 GHz and 202 GB RAM. |
| Software Dependencies | Yes | We implement our branch and bound K-Medoids algorithm with Lagrangian based lower bound method (BB+LD) in Julia 1.6.1. We also implement four other methods to compare the performance, including the extensive form of KMedoids problem solved by the state-of-art global optimizer CPLEX 20.1.0 [14] (CPLEX)... |
| Experiment Setup | Yes | We present the results of cluster number K = 3 in the main body with additional results K = 5 and K = 10 in Appendix C. Both LD and BB+LD use the same sub-gradient method called the Volume method [21]. Particularly, for BB+LD, the maximum iteration of Lagrangian multiplier updates in each node is 10. For LD, the maximum iteration is 10,000. In the heuristic method, we execute the K-Means-like method 20 times with a fixed seed set including 20 different seeds, then feed the solution set as initial guesses to ECA to get the best solutions. All the deterministic methods CPLEX, LD, BB+Basics, and BB+LD, share the same terminal criteria, including (1) Gap 0.1%, (2) the solving time reaches 4 hours, (3) the number of solved nodes reaches 5 million. |