Fast and Accurate $k$-means++ via Rejection Sampling
Authors: Vincent Cohen-Addad, Silvio Lattanzi, Ashkan Norouzi-Fard, Christian Sohler, Ola Svensson
NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section we empirically validate out theoretical results by comparing our algorithms FASTKMEANS++ and REJECTIONSAMPLING (see the details on how we set the parameters for LSH in the full version ) with the following two baselines: K-MEANS++ algorithm: Perhaps the most commonly used algorithm in this field. It samples k points according to the D2-distribution. AFKMC2 algorithm: A recent result [4] based on random walks that improves the running time of the K-MEANS++ algorithm while maintaining a (weaker) theoretical guarantee on the solution quality. |
| Researcher Affiliation | Collaboration | Vincent Cohen-Addad Google Research cohenaddad@google.com Silvio Lattanzi Google Research silviol@google.com Ashkan Norouzi-Fard Google Research ashkannorouzi@google.com Christian Sohler University of Cologne csohler@uni-koeln.de Ola Svensson EPFL ola.svensson@epfl.ch |
| Pseudocode | Yes | Algorithm 1 MULTITREEOPEN, Algorithm 2 MULTITREESAMPLE, Algorithm 3 FASTK-MEANS++, Algorithm 4 REJECTIONSAMPLING |
| Open Source Code | No | The paper does not include an unambiguous statement that the authors are releasing their source code for the described methodology, nor does it provide a direct link to a code repository for their proposed algorithms. |
| Open Datasets | Yes | We ran our algorithms on three classic datasets from UCI library [15]: KDD-Cup [12] (311, 029 points of dimension 74) and song [8] (515, 345 points of dimension 90) Census [19] (2, 458, 285 points of dimension 68). |
| Dataset Splits | No | The paper mentions using datasets but does not specify any explicit train/validation/test splits, percentages, or methodology for partitioning the data, nor does it reference predefined splits with citations. |
| Hardware Specification | No | The paper states 'The algorithms were run on a standard desktop computer' but provides no specific details regarding the CPU, GPU, memory, or any other hardware components used for the experiments. |
| Software Dependencies | No | The paper does not list specific software dependencies with their version numbers (e.g., Python, PyTorch, or specific solvers with versions) that would be needed to replicate the experiments. |
| Experiment Setup | No | The paper mentions using the AFKMC2 algorithm with a parameter 'm = 200', but it does not provide specific hyperparameter values or detailed training configurations for its own proposed algorithms (FASTK-MEANS++ and REJECTIONSAMPLING) within the main text. |