Exact Acceleration of K-Means++ and K-Means||
Authors: Edward Raff
IJCAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Now that we have detailed the methods by which we accelerate the K-means++ and K-means algorithms, we will evaluate their effectiveness. The two measures we are concerned with are the following: 1) reducing the total number of distance computations and 2) the total run-time spent. Measuring distance computations gives us an upper-bound on potential effectiveness of our algorithm, and allows us to compare approaches in an implementation and hardware independent manner. Measuring the run-time gives us information about the ultimate goal, which is to reduce the time it takes to obtain K seeds. However, it is sensitive to the hardware in use, the language the approach is implemented in, and the relative skills of program authors. For this work we used the JSAT library [Raff, 2017]. The K-means++ algorithm was provided by this framework, and we implemented the K-means and accelerated versions of both algorithms using JSAT. This way all comparisons with respect to run-time and the K-means++ and algorithms presented are directly comparable. Our implementations have been contributed to the JSAT library for public use. Prior works that have investigated alternatives to K-means++ have generally explored only a few datasets with D<100 features and less than 4 values of K, sometimes testing only one value of K per dataset [Bachem et al., 2016b]. For example, while MNIST is regularly tested in seed selection, it is usually projected down to 50 dimensions first [Hamerly, 2010] due to being difficult to accelerate. Since our goal is to produce accelerated versions of these algorithms that are uniformly better, we attempt to test over a wide selection of reasonable scenarios. In Table 1 we show the 11 datasets we use, with D [3, 780], and n covering four orders of magnitude. We will test K [32, 4096] covering each power of two so that we may understand the behavior as K changes and to make sure we produce an improvement even when K is small. To the best of our knowledge, this is a larger number of datasets, range and values of K, and range and values of D to be tested compared to prior work1. Unless stated otherwise, all experiments were done with a single CPU core from an i Mac with a 3.5 GHz Intel i5 CPU with 64 GB of RAM. The phishing dataset is only tested up to K=2048, because at K=4096 we would be selecting over 1/4 of the dataset as means, at which point the purpose of K-means++ style seeding is being defeated by selecting too large a portion of the corpus. All results are averaged over 5 runs, and took four months to complete in our compute environment. |
| Researcher Affiliation | Collaboration | Edward Raff Booz Allen Hamilton University of Maryland, Baltimore County raff_edward@bah.com, raff.edward@umbc.edu |
| Pseudocode | Yes | Algorithm 1 K-Means++ Algorithm 2 Our Accelerated K-Means++ Algorithm 3 K-Means Algorithm 4 Nearest Neighbor Search in VP Tree Algorithm 5 Our Accelerated K-Means |
| Open Source Code | Yes | Our implementations have been contributed to the JSAT library for public use. |
| Open Datasets | Yes | In Table 1 we show the 11 datasets we use, with D [3, 780], and n covering four orders of magnitude. We will test K [32, 4096] covering each power of two so that we may understand the behavior as K changes and to make sure we produce an improvement even when K is small. To the best of our knowledge, this is a larger number of datasets, range and values of K, and range and values of D to be tested compared to prior work1. Unless stated otherwise, all experiments were done with a single CPU core from an i Mac with a 3.5 GHz Intel i5 CPU with 64 GB of RAM. The phishing dataset is only tested up to K=2048, because at K=4096 we would be selecting over 1/4 of the dataset as means, at which point the purpose of K-means++ style seeding is being defeated by selecting too large a portion of the corpus. All results are averaged over 5 runs, and took four months to complete in our compute environment. |
| Dataset Splits | No | The paper describes the datasets used and the range of K values tested, but it does not specify any training, validation, or test splits for the data. It mentions 'All results are averaged over 5 runs' but this does not describe data splitting. |
| Hardware Specification | Yes | Unless stated otherwise, all experiments were done with a single CPU core from an i Mac with a 3.5 GHz Intel i5 CPU with 64 GB of RAM. |
| Software Dependencies | No | For this work we used the JSAT library [Raff, 2017]. The K-means++ algorithm was provided by this framework, and we implemented the K-means and accelerated versions of both algorithms using JSAT. |
| Experiment Setup | No | The paper specifies the datasets used and the range of K values tested. It also mentions that 'All results are averaged over 5 runs'. However, it does not provide specific hyperparameters (e.g., learning rate, batch size, epochs) or other detailed system-level training settings for the algorithms beyond the hardware used. |