reproducibilityindex.ai

Distributed and Provably Good Seedings for k-Means in Constant Rounds

Authors: Olivier Bachem, Mario Lucic, Andreas Krause

ICML 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Theoretical	In this paper, we provide a novel analysis of the k-means algorithm that bounds the expected solution quality for any number of rounds and oversampling factors greater than k, the two parameters one needs to choose in practice. In particular, we show that k-means provides provably good clusterings even for a small, constant number of iterations. This theoretical ﬁnding explains the common observation that k-means performs extremely well in practice even if the number of rounds is low. We further provide a hard instance that shows that an additive error term as encountered in our analysis is inevitable if less than k 1 rounds are employed.
Researcher Affiliation	Academia	1Department of Computer Science, ETH Zurich.
Pseudocode	Yes	Algorithm 1 k-means++ seeding Algorithm 2 k-means overseeding Algorithm 3 k-means seeding
Open Source Code	No	The paper mentions a third-party library: "A popular choice is the MLLib library of Apache Spark (Meng et al., 2016) which uses k-means by default." However, it does not state that the authors' own code for the described methodology is open-source or provide a link to it.
Open Datasets	No	The paper is theoretical and does not conduct empirical experiments on any dataset, thus no dataset availability information is provided.
Dataset Splits	No	The paper is theoretical and does not include empirical experiments, therefore no dataset splits for training, validation, or testing are mentioned.
Hardware Specification	No	The paper focuses on theoretical analysis and does not describe any computational experiments that would require hardware specifications.
Software Dependencies	No	The paper focuses on theoretical analysis and does not list any specific software dependencies with version numbers required for reproducibility of computational experiments.
Experiment Setup	No	The paper is theoretical and does not describe an experimental setup with specific hyperparameters or training settings.