Massively Parallel $k$-Means Clustering for Perturbation Resilient Instances

Authors: Vincent Cohen-Addad, Vahab Mirrokni, Peilin Zhong

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Finally, we present an empirical study of algorithms validating their effectiveness.5. Experiments In addition to the formal analysis of the theoretical guarantees of our k-means algorithms, we also conduct preliminary experiments on both synthetic datasets and real datasets in the offline setting to further investigate their practical performances.
Researcher Affiliation Industry Vincent Cohen-Addad * 1 Vahab Mirrokni * 1 Peilin Zhong * 1 1Google Research. Correspondence to: Vincent Cohen-Addad <cohenaddad@google.com>, Vahab Mirrokni <mirrokni@google.com>, Peilin Zhong <peilinz@google.com>.
Pseudocode Yes Algorithm 1 Near Neighbor Graph via LSH, Algorithm 2 Candidates of Optimal Clusters, Algorithm 3 Tree Construction over Candidate Clusters, Algorithm 4 Conversion to A Binary Tree, Algorithm 5 Dynamic Programming for Exact Solution, Algorithm 6 Dynamic Programming for Approximate Solution, Algorithm 7 Framework for Dynamic Programming in MPC, Algorithm 8 LSH for Points in Euclidean Spaces.
Open Source Code No The paper does not contain any explicit statement or link indicating that the source code for the described methodology is publicly available.
Open Datasets Yes All datasets can be obtained by sklearn.datasets.fetch openml of the Python scikit-learn package.
Dataset Splits No The paper describes experiments on synthetic and real datasets but does not explicitly provide specific training/validation/test dataset splits.
Hardware Specification Yes We ran experiments on a machine with 16G RAM and Intel Core i7-3720QM@2.60GHz CPU. All experiments were in single threaded mode.
Software Dependencies No All codes are in C++. ... k-means solver using k-means++ seeding implemeted by Python scikit-learn package (Pedregosa et al., 2011). No specific version numbers for C++ compiler or scikit-learn are provided.
Experiment Setup Yes We use the same choice as (Datar et al., 2004), i.e., r = 4 and m = 10, and thus we choose t of Algorithm 1 to be 100. For Algorithm 2, instead of running Algorithm 1 for each scale 1.01i, we run for each scale 1.1i. For Algorithm 6, we choose ε = 0.1.