reproducibilityindex.ai

Analyzing $D^α$ seeding for $k$-means

Authors: Etienne Bamas, Sai Ganesh Nagarajan, Ola Svensson

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Finally, we provide an experimental validation of the effects of the aforementioned parameters when using Dα. Our main contribution is a theoretical analysis of the advantage of Dα seeding, proving that it leads to constant-factor approximation guarantees for a large class of instances, including a balanced mixture of k Gaussians, where the standard k-means++ algorithm is already no better than Ω(log k) (see Section 3). We run additional experiments in Section B.
Researcher Affiliation	Academia	1ETH AI Center, Zurich, Switzerland. 2Zuse Institute Berlin 3EPFL.
Pseudocode	No	The paper describes algorithms in text, such as 'The k-means++ algorithm' and 'The greedy k-means++ algorithm', but does not provide them in structured pseudocode or an algorithm block.
Open Source Code	Yes	Dα Seeding. Link to code. https://github. com/saiganesh223/Alpha-Seeding-for-kmeans, 2024.
Open Datasets	Yes	For instance, on the MNIST dataset, they find that setting α close to 4 is a significantly better choice than α = 2. We run the Dα-seeding on the following instances: 1) D1: A mixture of 4 Gaussians... 5) D5: A mixture of 4 bivariate student-t distributions...
Dataset Splits	No	We generate these instance with n = 1000 samples from the appropriate mixture distribution. Then we consider α = {2, 6, 10, . . . , 38}, i.e, starting from 2 and increments of 4 for the Gaussian instances, i.e, D1 through D4. In all experiments, we show the average performance over N = 5000 trial runs for each value of α. No explicit mention of train/validation/test splits.
Hardware Specification	No	No specific hardware details (such as GPU/CPU models, processor types, or memory amounts) used for running experiments are provided in the paper.
Software Dependencies	No	The paper mentions 'Scikitlearn library' (Pedregosa et al., 2011) and that 'Machine Learning in Python' is used, but does not provide specific version numbers for these or any other software dependencies.
Experiment Setup	Yes	We generate these instance with n = 1000 samples from the appropriate mixture distribution. Then we consider α = {2, 6, 10, . . . , 38}, i.e, starting from 2 and increments of 4 for the Gaussian instances, i.e, D1 through D4. In all experiments, we show the average performance over N = 5000 trial runs for each value of α.