reproducibilityindex.ai

Global Optimal K-Medoids Clustering of One Million Samples

Authors: Jiayang Ren, Kaixun Hua, Yankai Cao

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive computational studies on 28 machine learning datasets demonstrate that our algorithm can provide a provable global optimal solution with an optimality gap of 0.1% within 4 hours on datasets with up to one million samples.
Researcher Affiliation	Academia	Jiayang Ren , Kaixun Hua , Yankai Cao University of British Columbia
Pseudocode	Yes	Algorithm 1 Tailored Reduced-space Branch and Bound Scheme for K-Medoids clustering, Algorithm 2 Probing, Algorithm 3 Cluster Assignment, Algorithm 4 Feasibility Based Bound Tightening, Algorithm 5 Lagrangian Based Lower Bounding
Open Source Code	Yes	The complete code files can be found in https: //github.com/Yankai Group/global_kmedoids_clustering
Open Datasets	Yes	To evaluate the performance, we use 25 datasets from the UCI Machine Learning Repository [26], one dataset called PR2392 from [27], one dataset called HEMI[28] and one synthetic dataset from [19].
Dataset Splits	No	The paper evaluates a K-Medoids clustering solver on entire datasets and does not describe explicit training, validation, or test dataset splits for its experiments.
Hardware Specification	Yes	We executed all the experiments on a high-performance computing cluster, of which each node contains 40 Intel cores at 2.4 GHz and 202 GB RAM.
Software Dependencies	Yes	We implement our branch and bound K-Medoids algorithm with Lagrangian based lower bound method (BB+LD) in Julia 1.6.1. We also implement four other methods to compare the performance, including the extensive form of KMedoids problem solved by the state-of-art global optimizer CPLEX 20.1.0 [14] (CPLEX)...
Experiment Setup	Yes	We present the results of cluster number K = 3 in the main body with additional results K = 5 and K = 10 in Appendix C. Both LD and BB+LD use the same sub-gradient method called the Volume method [21]. Particularly, for BB+LD, the maximum iteration of Lagrangian multiplier updates in each node is 10. For LD, the maximum iteration is 10,000. In the heuristic method, we execute the K-Means-like method 20 times with a fixed seed set including 20 different seeds, then feed the solution set as initial guesses to ECA to get the best solutions. All the deterministic methods CPLEX, LD, BB+Basics, and BB+LD, share the same terminal criteria, including (1) Gap 0.1%, (2) the solving time reaches 4 hours, (3) the number of solved nodes reaches 5 million.