Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Exact Acceleration of K-Means++ and K-Means||
Authors: Edward Raff
IJCAI 2021 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Now that we have detailed the methods by which we accelerate the K-means++ and K-means algorithms, we will evaluate their effectiveness. The two measures we are concerned with are the following: 1) reducing the total number of distance computations and 2) the total run-time spent. Measuring distance computations gives us an upper-bound on potential effectiveness of our algorithm, and allows us to compare approaches in an implementation and hardware independent manner. Measuring the run-time gives us information about the ultimate goal, which is to reduce the time it takes to obtain K seeds. However, it is sensitive to the hardware in use, the language the approach is implemented in, and the relative skills of program authors. For this work we used the JSAT library [Raff, 2017]. The K-means++ algorithm was provided by this framework, and we implemented the K-means and accelerated versions of both algorithms using JSAT. This way all comparisons with respect to run-time and the K-means++ and algorithms presented are directly comparable. Our implementations have been contributed to the JSAT library for public use. Prior works that have investigated alternatives to K-means++ have generally explored only a few datasets with D<100 features and less than 4 values of K, sometimes testing only one value of K per dataset [Bachem et al., 2016b]. For example, while MNIST is regularly tested in seed selection, it is usually projected down to 50 dimensions first [Hamerly, 2010] due to being difficult to accelerate. Since our goal is to produce accelerated versions of these algorithms that are uniformly better, we attempt to test over a wide selection of reasonable scenarios. In Table 1 we show the 11 datasets we use, with D [3, 780], and n covering four orders of magnitude. We will test K [32, 4096] covering each power of two so that we may understand the behavior as K changes and to make sure we produce an improvement even when K is small. To the best of our knowledge, this is a larger number of datasets, range and values of K, and range and values of D to be tested compared to prior work1. Unless stated otherwise, all experiments were done with a single CPU core from an i Mac with a 3.5 GHz Intel i5 CPU with 64 GB of RAM. The phishing dataset is only tested up to K=2048, because at K=4096 we would be selecting over 1/4 of the dataset as means, at which point the purpose of K-means++ style seeding is being defeated by selecting too large a portion of the corpus. All results are averaged over 5 runs, and took four months to complete in our compute environment. |
| Researcher Affiliation | Collaboration | Edward Raff Booz Allen Hamilton University of Maryland, Baltimore County raff_edward@bah.com, raffEMAIL |
| Pseudocode | Yes | Algorithm 1 K-Means++ Algorithm 2 Our Accelerated K-Means++ Algorithm 3 K-Means Algorithm 4 Nearest Neighbor Search in VP Tree Algorithm 5 Our Accelerated K-Means |
| Open Source Code | Yes | Our implementations have been contributed to the JSAT library for public use. |
| Open Datasets | Yes | In Table 1 we show the 11 datasets we use, with D [3, 780], and n covering four orders of magnitude. We will test K [32, 4096] covering each power of two so that we may understand the behavior as K changes and to make sure we produce an improvement even when K is small. To the best of our knowledge, this is a larger number of datasets, range and values of K, and range and values of D to be tested compared to prior work1. Unless stated otherwise, all experiments were done with a single CPU core from an i Mac with a 3.5 GHz Intel i5 CPU with 64 GB of RAM. The phishing dataset is only tested up to K=2048, because at K=4096 we would be selecting over 1/4 of the dataset as means, at which point the purpose of K-means++ style seeding is being defeated by selecting too large a portion of the corpus. All results are averaged over 5 runs, and took four months to complete in our compute environment. |
| Dataset Splits | No | The paper describes the datasets used and the range of K values tested, but it does not specify any training, validation, or test splits for the data. It mentions 'All results are averaged over 5 runs' but this does not describe data splitting. |
| Hardware Specification | Yes | Unless stated otherwise, all experiments were done with a single CPU core from an i Mac with a 3.5 GHz Intel i5 CPU with 64 GB of RAM. |
| Software Dependencies | No | For this work we used the JSAT library [Raff, 2017]. The K-means++ algorithm was provided by this framework, and we implemented the K-means and accelerated versions of both algorithms using JSAT. |
| Experiment Setup | No | The paper specifies the datasets used and the range of K values tested. It also mentions that 'All results are averaged over 5 runs'. However, it does not provide specific hyperparameters (e.g., learning rate, batch size, epochs) or other detailed system-level training settings for the algorithms beyond the hardware used. |