K-Medoids For K-Means Seeding

Authors: James Newling, François Fleuret

NeurIPS 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We show experimentally that the algorithm clarans of Ng and Han (1994) finds better K-medoids solutions than the Voronoi iteration algorithm of Hastie et al. (2001). This finding, along with the similarity between the Voronoi iteration algorithm and Lloyd s K-means algorithm, motivates us to use clarans as a K-means initializer. We show that clarans outperforms other algorithms on 23/23 datasets with a mean decrease over k-means-++ (Arthur and Vassilvitskii, 2007) of 30% for initialization mean squared error (MSE) and 3% for final MSE.
Researcher Affiliation Academia James Newling Idiap Research Institue and Ecole polytechnique f ed erale de Lausanne james.newling@idiap.ch; Franc ois Fleuret Idiap Research Institue and Ecole polytechnique f ed erale de Lausanne francois.fleuret@idiap.ch
Pseudocode Yes Algorithm 1 two-step iterative medlloyd algorithm (in vector space with quadratic potential).; Algorithm 2 swap-based clarans algorithm (in a vector space and with quadratic potential).
Open Source Code Yes Our source code at https://github.com/idiap/zentas is available under an open source license. It consists of a C++ library with Python interface, with several examples for diverse data types (sequence data, sparse and dense vectors), metrics (Levenshtein, l1, etc.) and potentials (quadratic as in K-means, logarithmic, etc.).
Open Datasets Yes The 23 datasets. Column TL is time allocated to run with each initialization scheme, so that no new runs start after TL elapsed seconds. The starred datasets are those used in Bachem et al. (2016), the remainder are available at https://cs.joensuu.fi/sipu/datasets.
Dataset Splits No The paper uses common benchmark datasets but does not explicitly state the specific train/validation/test splits used for these datasets in their experiments, nor does it define a cross-validation setup or provide sample counts for splits.
Hardware Specification No No specific hardware details (e.g., GPU/CPU models, processor types, memory amounts) used for running experiments are mentioned in the paper.
Software Dependencies No The paper mentions C++, Python, and Cython, and refers to specific implementations by other authors (e.g., afk-mc2 of Bachem et al. (2016)), but it does not provide specific version numbers for any programming languages, libraries, or software dependencies required to replicate the experiments.
Experiment Setup Yes Rather than run each method a fixed number of times, we therefore run each method as many times as possible in a given time limit, TL . This dataset dependent time limit, given by columns TL in Table 3, is taken as 80 the time of a single run of km+++lloyd. Recall that our stopping criterion for clarans is K2 consecutively rejected swap proposals. We use K2 rejections for simplicity, although have found that fewer than K2 are in general needed to obtain minimal MSEs.