Analyzing $D^α$ seeding for $k$-means
Authors: Etienne Bamas, Sai Ganesh Nagarajan, Ola Svensson
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Finally, we provide an experimental validation of the effects of the aforementioned parameters when using Dα. Our main contribution is a theoretical analysis of the advantage of Dα seeding, proving that it leads to constant-factor approximation guarantees for a large class of instances, including a balanced mixture of k Gaussians, where the standard k-means++ algorithm is already no better than Ω(log k) (see Section 3). We run additional experiments in Section B. |
| Researcher Affiliation | Academia | 1ETH AI Center, Zurich, Switzerland. 2Zuse Institute Berlin 3EPFL. |
| Pseudocode | No | The paper describes algorithms in text, such as 'The k-means++ algorithm' and 'The greedy k-means++ algorithm', but does not provide them in structured pseudocode or an algorithm block. |
| Open Source Code | Yes | Dα Seeding. Link to code. https://github. com/saiganesh223/Alpha-Seeding-for-kmeans, 2024. |
| Open Datasets | Yes | For instance, on the MNIST dataset, they find that setting α close to 4 is a significantly better choice than α = 2. We run the Dα-seeding on the following instances: 1) D1: A mixture of 4 Gaussians... 5) D5: A mixture of 4 bivariate student-t distributions... |
| Dataset Splits | No | We generate these instance with n = 1000 samples from the appropriate mixture distribution. Then we consider α = {2, 6, 10, . . . , 38}, i.e, starting from 2 and increments of 4 for the Gaussian instances, i.e, D1 through D4. In all experiments, we show the average performance over N = 5000 trial runs for each value of α. No explicit mention of train/validation/test splits. |
| Hardware Specification | No | No specific hardware details (such as GPU/CPU models, processor types, or memory amounts) used for running experiments are provided in the paper. |
| Software Dependencies | No | The paper mentions 'Scikitlearn library' (Pedregosa et al., 2011) and that 'Machine Learning in Python' is used, but does not provide specific version numbers for these or any other software dependencies. |
| Experiment Setup | Yes | We generate these instance with n = 1000 samples from the appropriate mixture distribution. Then we consider α = {2, 6, 10, . . . , 38}, i.e, starting from 2 and increments of 4 for the Gaussian instances, i.e, D1 through D4. In all experiments, we show the average performance over N = 5000 trial runs for each value of α. |