Yinyang K-Means: A Drop-In Replacement of the Classic K-Means with Consistent Speedup

Authors: Yufei Ding, Yue Zhao, Xipeng Shen, Madanlal Musuvathi, Todd Mytkowicz

ICML 2015 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on a spectrum of problem settings and machines show that Yinyang K-means excels in all the cases, consistently outperforming the classic K-means by an order of magnitude and the fastest prior known K-means algorithms by more than three times on average. Section 5. Experiments.
Researcher Affiliation Collaboration Yufei Ding YDING8@NCSU.EDU Yue Zhao YZHAO30@NCSU.EDU Xipeng Shen XSHEN5@NCSU.EDU Department of Computer Science, North Carolina State University Microsoft Research Madanlal Musuvathi MADANM@MICROSOFT.COM Todd Mytkowicz TODDM@MICROSOFT.COM
Pseudocode Yes 3. Algorithm. Putting the group filter, local filter, and new center update algorithm together, we get the complete Yinyang K-means as follows. Step 1: Set t to a value no greater than k/10 and meeting the space constraint. Group the initial centers into t groups, {Gi|i = 1, 2, , t} by running K-means on just those initial groups for five iterations to produce reasonable groups while incurring little overhead.
Open Source Code Yes The source code of this work is available at http://research.csc.ncsu.edu/nc-caps/yykmeans.tar.bz2 .
Open Datasets Yes We use eight real world large data sets, four of which are taken from the UCI machine learning repository (Bache & Lichman, 2013), while the other four are commonly used image data sets(Wang et al., 2012; 2013).
Dataset Splits No The paper mentions using specific datasets for experiments but does not explicitly provide training/test/validation dataset splits, percentages, or cross-validation details needed to reproduce the experiment.
Hardware Specification Yes Table 2: 'Time and speedup on an Ivybridge machine (16GB memory, 8-core i7-3770K processor)'; Table 3: 'Overall speedup over standard K-means on a Core2 machine (4GB mem, 4-core Core2 CPU)'
Software Dependencies No The paper mentions software like 'Graphlab (Low et al., 2010)', 'Open CV (Open CV)', and 'mlpack (Curtin et al., 2013)', but does not provide specific version numbers for these software dependencies.
Experiment Setup Yes The parameter t provides a design knob for controlling the space overhead and redundant distance elimination. ... t is set to k/10 if space allows; o.w., the largest possible value is used. ... Group the initial centers into t groups, {Gi|i = 1, 2, , t} by running K-means on just those initial groups for five iterations to produce reasonable groups while incurring little overhead.