Distributed and Provably Good Seedings for k-Means in Constant Rounds

Authors: Olivier Bachem, Mario Lucic, Andreas Krause

ICML 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Theoretical In this paper, we provide a novel analysis of the k-means algorithm that bounds the expected solution quality for any number of rounds and oversampling factors greater than k, the two parameters one needs to choose in practice. In particular, we show that k-means provides provably good clusterings even for a small, constant number of iterations. This theoretical finding explains the common observation that k-means performs extremely well in practice even if the number of rounds is low. We further provide a hard instance that shows that an additive error term as encountered in our analysis is inevitable if less than k 1 rounds are employed.
Researcher Affiliation Academia 1Department of Computer Science, ETH Zurich.
Pseudocode Yes Algorithm 1 k-means++ seeding Algorithm 2 k-means overseeding Algorithm 3 k-means seeding
Open Source Code No The paper mentions a third-party library: "A popular choice is the MLLib library of Apache Spark (Meng et al., 2016) which uses k-means by default." However, it does not state that the authors' own code for the described methodology is open-source or provide a link to it.
Open Datasets No The paper is theoretical and does not conduct empirical experiments on any dataset, thus no dataset availability information is provided.
Dataset Splits No The paper is theoretical and does not include empirical experiments, therefore no dataset splits for training, validation, or testing are mentioned.
Hardware Specification No The paper focuses on theoretical analysis and does not describe any computational experiments that would require hardware specifications.
Software Dependencies No The paper focuses on theoretical analysis and does not list any specific software dependencies with version numbers required for reproducibility of computational experiments.
Experiment Setup No The paper is theoretical and does not describe an experimental setup with specific hyperparameters or training settings.