Fast and Accurate $k$-means++ via Rejection Sampling

Authors: Vincent Cohen-Addad, Silvio Lattanzi, Ashkan Norouzi-Fard, Christian Sohler, Ola Svensson

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section we empirically validate out theoretical results by comparing our algorithms FASTKMEANS++ and REJECTIONSAMPLING (see the details on how we set the parameters for LSH in the full version ) with the following two baselines: K-MEANS++ algorithm: Perhaps the most commonly used algorithm in this field. It samples k points according to the D2-distribution. AFKMC2 algorithm: A recent result [4] based on random walks that improves the running time of the K-MEANS++ algorithm while maintaining a (weaker) theoretical guarantee on the solution quality.
Researcher Affiliation Collaboration Vincent Cohen-Addad Google Research cohenaddad@google.com Silvio Lattanzi Google Research silviol@google.com Ashkan Norouzi-Fard Google Research ashkannorouzi@google.com Christian Sohler University of Cologne csohler@uni-koeln.de Ola Svensson EPFL ola.svensson@epfl.ch
Pseudocode Yes Algorithm 1 MULTITREEOPEN, Algorithm 2 MULTITREESAMPLE, Algorithm 3 FASTK-MEANS++, Algorithm 4 REJECTIONSAMPLING
Open Source Code No The paper does not include an unambiguous statement that the authors are releasing their source code for the described methodology, nor does it provide a direct link to a code repository for their proposed algorithms.
Open Datasets Yes We ran our algorithms on three classic datasets from UCI library [15]: KDD-Cup [12] (311, 029 points of dimension 74) and song [8] (515, 345 points of dimension 90) Census [19] (2, 458, 285 points of dimension 68).
Dataset Splits No The paper mentions using datasets but does not specify any explicit train/validation/test splits, percentages, or methodology for partitioning the data, nor does it reference predefined splits with citations.
Hardware Specification No The paper states 'The algorithms were run on a standard desktop computer' but provides no specific details regarding the CPU, GPU, memory, or any other hardware components used for the experiments.
Software Dependencies No The paper does not list specific software dependencies with their version numbers (e.g., Python, PyTorch, or specific solvers with versions) that would be needed to replicate the experiments.
Experiment Setup No The paper mentions using the AFKMC2 algorithm with a parameter 'm = 200', but it does not provide specific hyperparameter values or detailed training configurations for its own proposed algorithms (FASTK-MEANS++ and REJECTIONSAMPLING) within the main text.