reproducibilityindex.ai

K-Medoids For K-Means Seeding

Authors: James Newling, François Fleuret

NeurIPS 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We show experimentally that the algorithm clarans of Ng and Han (1994) ﬁnds better K-medoids solutions than the Voronoi iteration algorithm of Hastie et al. (2001). This ﬁnding, along with the similarity between the Voronoi iteration algorithm and Lloyd s K-means algorithm, motivates us to use clarans as a K-means initializer. We show that clarans outperforms other algorithms on 23/23 datasets with a mean decrease over k-means-++ (Arthur and Vassilvitskii, 2007) of 30% for initialization mean squared error (MSE) and 3% for ﬁnal MSE.
Researcher Affiliation	Academia	James Newling Idiap Research Institue and Ecole polytechnique f ed erale de Lausanne james.newling@idiap.ch; Franc ois Fleuret Idiap Research Institue and Ecole polytechnique f ed erale de Lausanne francois.fleuret@idiap.ch
Pseudocode	Yes	Algorithm 1 two-step iterative medlloyd algorithm (in vector space with quadratic potential).; Algorithm 2 swap-based clarans algorithm (in a vector space and with quadratic potential).
Open Source Code	Yes	Our source code at https://github.com/idiap/zentas is available under an open source license. It consists of a C++ library with Python interface, with several examples for diverse data types (sequence data, sparse and dense vectors), metrics (Levenshtein, l1, etc.) and potentials (quadratic as in K-means, logarithmic, etc.).
Open Datasets	Yes	The 23 datasets. Column TL is time allocated to run with each initialization scheme, so that no new runs start after TL elapsed seconds. The starred datasets are those used in Bachem et al. (2016), the remainder are available at https://cs.joensuu.fi/sipu/datasets.
Dataset Splits	No	The paper uses common benchmark datasets but does not explicitly state the specific train/validation/test splits used for these datasets in their experiments, nor does it define a cross-validation setup or provide sample counts for splits.
Hardware Specification	No	No specific hardware details (e.g., GPU/CPU models, processor types, memory amounts) used for running experiments are mentioned in the paper.
Software Dependencies	No	The paper mentions C++, Python, and Cython, and refers to specific implementations by other authors (e.g., afk-mc2 of Bachem et al. (2016)), but it does not provide specific version numbers for any programming languages, libraries, or software dependencies required to replicate the experiments.
Experiment Setup	Yes	Rather than run each method a ﬁxed number of times, we therefore run each method as many times as possible in a given time limit, TL . This dataset dependent time limit, given by columns TL in Table 3, is taken as 80 the time of a single run of km+++lloyd. Recall that our stopping criterion for clarans is K2 consecutively rejected swap proposals. We use K2 rejections for simplicity, although have found that fewer than K2 are in general needed to obtain minimal MSEs.