Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
A Beyond-Worst-Case Analysis of Greedy k-means++
Authors: Qingyun Chen, Sungjin Im, Ben Moseley, Ryan Milstrey, Chenyang Xu, Ruilong Zhang
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments. Since it is well known that the greedy algorithm outperforms k-means++ in practice and empirical studies, we rather focus on our experiments on tracking how the algorithm makes choices over iterations towards a better seeding. We create synthetic data sets using various distributions and study how the algorithm makes progress in terms of covering new clusters over iterations, not only tracking how the objective changes. The experiments show that the greedy algorithm outperforms k-means++ in both decreasing the objective and covering new clusters. |
| Researcher Affiliation | Academia | Qingyun Chen UC Santa Cruz EMAIL Sungjin Im UC Santa Cruz EMAIL Ryan Milstrey UC Merced EMAIL Benjamin Moseley Carnegie Mellon University EMAIL Chenyang Xu East China Normal University EMAIL Ruilong Zhang Technical University of Munich EMAIL |
| Pseudocode | Yes | Algorithm 1 Greedy k-means++ Initialization Arthur et al. [2007] Input: A point set X Rm and parameters k > 0, โ= ln k3. Output: A center set S that serves as the initial centers of Lloyd s heuristic. 1: Independently and uniformly sample โpoints x1, . . . , xโ X. 2: Greedily pick x := arg minxi {x1,...,xโ} ฯ(X, {xi}) and set S {x}. 3: for t = 2, . . . , k do 4: Sample โpoints x1, . . . , xโ X independently (with replacement) with probability ฯ(x,S) 5: Greedily pick x := arg minxi {x1,...,xโ} ฯ(X, S {xi}), breaking ties arbitrarily. 6: Set S S {x}. 7: end for 8: return S. |
| Open Source Code | Yes | The submission includes the source code of the experiment in supplemental material. |
| Open Datasets | No | Input Data. We conduct experiments on synthetic datasets. To generate a dataset, we first fix a distribution. In the experiment, we use the exponential, half-normal (the absolute variant of the Gaussian), and Lomax (heavy-tail sub-exponential) distributions for different datasets. We first sample k centers in Rk uniformly at random from a unit hypercube. Then we sample the radius for each cluster uniformly at random from (0, 2). The number of points of each cluster is uniformly sampled from [64, 256]. To generate the points of a cluster, we choose uniform random points whose distance to the center follows the fixed distribution. Since the centers are sampled from a unit hypercube, it is well-separable because the distances between every two centers are k. The generation of the points guarantees that it is well-spread. |
| Dataset Splits | No | Input Data. We conduct experiments on synthetic datasets. To generate a dataset, we first fix a distribution. In the experiment, we use the exponential, half-normal (the absolute variant of the Gaussian), and Lomax (heavy-tail sub-exponential) distributions for different datasets. We first sample k centers in Rk uniformly at random from a unit hypercube. Then we sample the radius for each cluster uniformly at random from (0, 2). The number of points of each cluster is uniformly sampled from [64, 256]. To generate the points of a cluster, we choose uniform random points whose distance to the center follows the fixed distribution. Since the centers are sampled from a unit hypercube, it is well-separable because the distances between every two centers are k. The generation of the points guarantees that it is well-spread. The paper describes how the synthetic data is generated but does not specify any training/validation/test splits. |
| Hardware Specification | Yes | Our experiments are conducted on a machine with Processor 11th Gen Intel(R) Core(TM) i5-1135G7 2.40GHz, 1382 Mhz, 4 Core(s), 8 Logical Processor(s) and 12 GB RAM. |
| Software Dependencies | No | The paper mentions the Scikit-learn library in the context of common parameter settings but does not provide specific version numbers for any software used in their own experimental implementation. |
| Experiment Setup | Yes | For greedy k-means++, we sample ln k + 2 candidates in every iteration. The experiment involves 100 repetitions. We compare the average objective and coverage probability in every iteration. The average objective in an iteration is the average k-means objective using the currently chosen centers in 100 repetitions. Similarly, the average coverage probability is the probability that a new center is covered in 100 repetitions. We show the results over iterations for k = 16 in Figure 1 (refer to Appendix D for different k). |