Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Yinyang K-Means: A Drop-In Replacement of the Classic K-Means with Consistent Speedup
Authors: Yufei Ding, Yue Zhao, Xipeng Shen, Madanlal Musuvathi, Todd Mytkowicz
ICML 2015 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on a spectrum of problem settings and machines show that Yinyang K-means excels in all the cases, consistently outperforming the classic K-means by an order of magnitude and the fastest prior known K-means algorithms by more than three times on average. Section 5. Experiments. |
| Researcher Affiliation | Collaboration | Yufei Ding EMAIL Yue Zhao EMAIL Xipeng Shen EMAIL Department of Computer Science, North Carolina State University Microsoft Research Madanlal Musuvathi EMAIL Todd Mytkowicz EMAIL |
| Pseudocode | Yes | 3. Algorithm. Putting the group filter, local filter, and new center update algorithm together, we get the complete Yinyang K-means as follows. Step 1: Set t to a value no greater than k/10 and meeting the space constraint. Group the initial centers into t groups, {Gi|i = 1, 2, , t} by running K-means on just those initial groups for five iterations to produce reasonable groups while incurring little overhead. |
| Open Source Code | Yes | The source code of this work is available at http://research.csc.ncsu.edu/nc-caps/yykmeans.tar.bz2 . |
| Open Datasets | Yes | We use eight real world large data sets, four of which are taken from the UCI machine learning repository (Bache & Lichman, 2013), while the other four are commonly used image data sets(Wang et al., 2012; 2013). |
| Dataset Splits | No | The paper mentions using specific datasets for experiments but does not explicitly provide training/test/validation dataset splits, percentages, or cross-validation details needed to reproduce the experiment. |
| Hardware Specification | Yes | Table 2: 'Time and speedup on an Ivybridge machine (16GB memory, 8-core i7-3770K processor)'; Table 3: 'Overall speedup over standard K-means on a Core2 machine (4GB mem, 4-core Core2 CPU)' |
| Software Dependencies | No | The paper mentions software like 'Graphlab (Low et al., 2010)', 'Open CV (Open CV)', and 'mlpack (Curtin et al., 2013)', but does not provide specific version numbers for these software dependencies. |
| Experiment Setup | Yes | The parameter t provides a design knob for controlling the space overhead and redundant distance elimination. ... t is set to k/10 if space allows; o.w., the largest possible value is used. ... Group the initial centers into t groups, {Gi|i = 1, 2, , t} by running K-means on just those initial groups for five iterations to produce reasonable groups while incurring little overhead. |