Analyzing $D^α$ seeding for $k$-means

Authors: Etienne Bamas, Sai Ganesh Nagarajan, Ola Svensson

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Finally, we provide an experimental validation of the effects of the aforementioned parameters when using Dα. Our main contribution is a theoretical analysis of the advantage of Dα seeding, proving that it leads to constant-factor approximation guarantees for a large class of instances, including a balanced mixture of k Gaussians, where the standard k-means++ algorithm is already no better than Ω(log k) (see Section 3). We run additional experiments in Section B.
Researcher Affiliation Academia 1ETH AI Center, Zurich, Switzerland. 2Zuse Institute Berlin 3EPFL.
Pseudocode No The paper describes algorithms in text, such as 'The k-means++ algorithm' and 'The greedy k-means++ algorithm', but does not provide them in structured pseudocode or an algorithm block.
Open Source Code Yes Dα Seeding. Link to code. https://github. com/saiganesh223/Alpha-Seeding-for-kmeans, 2024.
Open Datasets Yes For instance, on the MNIST dataset, they find that setting α close to 4 is a significantly better choice than α = 2. We run the Dα-seeding on the following instances: 1) D1: A mixture of 4 Gaussians... 5) D5: A mixture of 4 bivariate student-t distributions...
Dataset Splits No We generate these instance with n = 1000 samples from the appropriate mixture distribution. Then we consider α = {2, 6, 10, . . . , 38}, i.e, starting from 2 and increments of 4 for the Gaussian instances, i.e, D1 through D4. In all experiments, we show the average performance over N = 5000 trial runs for each value of α. No explicit mention of train/validation/test splits.
Hardware Specification No No specific hardware details (such as GPU/CPU models, processor types, or memory amounts) used for running experiments are provided in the paper.
Software Dependencies No The paper mentions 'Scikitlearn library' (Pedregosa et al., 2011) and that 'Machine Learning in Python' is used, but does not provide specific version numbers for these or any other software dependencies.
Experiment Setup Yes We generate these instance with n = 1000 samples from the appropriate mixture distribution. Then we consider α = {2, 6, 10, . . . , 38}, i.e, starting from 2 and increments of 4 for the Gaussian instances, i.e, D1 through D4. In all experiments, we show the average performance over N = 5000 trial runs for each value of α.