Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Global Optimal K-Medoids Clustering of One Million Samples
Authors: Jiayang Ren, Kaixun Hua, Yankai Cao
NeurIPS 2022 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive computational studies on 28 machine learning datasets demonstrate that our algorithm can provide a provable global optimal solution with an optimality gap of 0.1% within 4 hours on datasets with up to one million samples. |
| Researcher Affiliation | Academia | Jiayang Ren , Kaixun Hua , Yankai Cao University of British Columbia |
| Pseudocode | Yes | Algorithm 1 Tailored Reduced-space Branch and Bound Scheme for K-Medoids clustering, Algorithm 2 Probing, Algorithm 3 Cluster Assignment, Algorithm 4 Feasibility Based Bound Tightening, Algorithm 5 Lagrangian Based Lower Bounding |
| Open Source Code | Yes | The complete code files can be found in https: //github.com/Yankai Group/global_kmedoids_clustering |
| Open Datasets | Yes | To evaluate the performance, we use 25 datasets from the UCI Machine Learning Repository [26], one dataset called PR2392 from [27], one dataset called HEMI[28] and one synthetic dataset from [19]. |
| Dataset Splits | No | The paper evaluates a K-Medoids clustering solver on entire datasets and does not describe explicit training, validation, or test dataset splits for its experiments. |
| Hardware Specification | Yes | We executed all the experiments on a high-performance computing cluster, of which each node contains 40 Intel cores at 2.4 GHz and 202 GB RAM. |
| Software Dependencies | Yes | We implement our branch and bound K-Medoids algorithm with Lagrangian based lower bound method (BB+LD) in Julia 1.6.1. We also implement four other methods to compare the performance, including the extensive form of KMedoids problem solved by the state-of-art global optimizer CPLEX 20.1.0 [14] (CPLEX)... |
| Experiment Setup | Yes | We present the results of cluster number K = 3 in the main body with additional results K = 5 and K = 10 in Appendix C. Both LD and BB+LD use the same sub-gradient method called the Volume method [21]. Particularly, for BB+LD, the maximum iteration of Lagrangian multiplier updates in each node is 10. For LD, the maximum iteration is 10,000. In the heuristic method, we execute the K-Means-like method 20 times with a fixed seed set including 20 different seeds, then feed the solution set as initial guesses to ECA to get the best solutions. All the deterministic methods CPLEX, LD, BB+Basics, and BB+LD, share the same terminal criteria, including (1) Gap 0.1%, (2) the solving time reaches 4 hours, (3) the number of solved nodes reaches 5 million. |