Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

A Beyond-Worst-Case Analysis of Greedy k-means++

Authors: Qingyun Chen, Sungjin Im, Ben Moseley, Ryan Milstrey, Chenyang Xu, Ruilong Zhang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments. Since it is well known that the greedy algorithm outperforms k-means++ in practice and empirical studies, we rather focus on our experiments on tracking how the algorithm makes choices over iterations towards a better seeding. We create synthetic data sets using various distributions and study how the algorithm makes progress in terms of covering new clusters over iterations, not only tracking how the objective changes. The experiments show that the greedy algorithm outperforms k-means++ in both decreasing the objective and covering new clusters.
Researcher Affiliation	Academia	Qingyun Chen UC Santa Cruz EMAIL Sungjin Im UC Santa Cruz EMAIL Ryan Milstrey UC Merced EMAIL Benjamin Moseley Carnegie Mellon University EMAIL Chenyang Xu East China Normal University EMAIL Ruilong Zhang Technical University of Munich EMAIL
Pseudocode	Yes	Algorithm 1 Greedy k-means++ Initialization Arthur et al. [2007] Input: A point set X Rm and parameters k > 0, ℓ= ln k3. Output: A center set S that serves as the initial centers of Lloyd s heuristic. 1: Independently and uniformly sample ℓpoints x1, . . . , xℓ X. 2: Greedily pick x := arg minxi {x1,...,xℓ} φ(X, {xi}) and set S {x}. 3: for t = 2, . . . , k do 4: Sample ℓpoints x1, . . . , xℓ X independently (with replacement) with probability φ(x,S) 5: Greedily pick x := arg minxi {x1,...,xℓ} φ(X, S {xi}), breaking ties arbitrarily. 6: Set S S {x}. 7: end for 8: return S.
Open Source Code	Yes	The submission includes the source code of the experiment in supplemental material.
Open Datasets	No	Input Data. We conduct experiments on synthetic datasets. To generate a dataset, we first fix a distribution. In the experiment, we use the exponential, half-normal (the absolute variant of the Gaussian), and Lomax (heavy-tail sub-exponential) distributions for different datasets. We first sample k centers in Rk uniformly at random from a unit hypercube. Then we sample the radius for each cluster uniformly at random from (0, 2). The number of points of each cluster is uniformly sampled from [64, 256]. To generate the points of a cluster, we choose uniform random points whose distance to the center follows the fixed distribution. Since the centers are sampled from a unit hypercube, it is well-separable because the distances between every two centers are k. The generation of the points guarantees that it is well-spread.
Dataset Splits	No	Input Data. We conduct experiments on synthetic datasets. To generate a dataset, we first fix a distribution. In the experiment, we use the exponential, half-normal (the absolute variant of the Gaussian), and Lomax (heavy-tail sub-exponential) distributions for different datasets. We first sample k centers in Rk uniformly at random from a unit hypercube. Then we sample the radius for each cluster uniformly at random from (0, 2). The number of points of each cluster is uniformly sampled from [64, 256]. To generate the points of a cluster, we choose uniform random points whose distance to the center follows the fixed distribution. Since the centers are sampled from a unit hypercube, it is well-separable because the distances between every two centers are k. The generation of the points guarantees that it is well-spread. The paper describes how the synthetic data is generated but does not specify any training/validation/test splits.
Hardware Specification	Yes	Our experiments are conducted on a machine with Processor 11th Gen Intel(R) Core(TM) i5-1135G7 2.40GHz, 1382 Mhz, 4 Core(s), 8 Logical Processor(s) and 12 GB RAM.
Software Dependencies	No	The paper mentions the Scikit-learn library in the context of common parameter settings but does not provide specific version numbers for any software used in their own experimental implementation.
Experiment Setup	Yes	For greedy k-means++, we sample ln k + 2 candidates in every iteration. The experiment involves 100 repetitions. We compare the average objective and coverage probability in every iteration. The average objective in an iteration is the average k-means objective using the currently chosen centers in 100 repetitions. Similarly, the average coverage probability is the probability that a new center is covered in 100 repetitions. We show the results over iterations for k = 16 in Figure 1 (refer to Appendix D for different k).