Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Massively Parallel $k$-Means Clustering for Perturbation Resilient Instances
Authors: Vincent Cohen-Addad, Vahab Mirrokni, Peilin Zhong
ICML 2022 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Finally, we present an empirical study of algorithms validating their effectiveness.5. Experiments In addition to the formal analysis of the theoretical guarantees of our k-means algorithms, we also conduct preliminary experiments on both synthetic datasets and real datasets in the of๏ฌine setting to further investigate their practical performances. |
| Researcher Affiliation | Industry | Vincent Cohen-Addad * 1 Vahab Mirrokni * 1 Peilin Zhong * 1 1Google Research. Correspondence to: Vincent Cohen-Addad <EMAIL>, Vahab Mirrokni <EMAIL>, Peilin Zhong <EMAIL>. |
| Pseudocode | Yes | Algorithm 1 Near Neighbor Graph via LSH, Algorithm 2 Candidates of Optimal Clusters, Algorithm 3 Tree Construction over Candidate Clusters, Algorithm 4 Conversion to A Binary Tree, Algorithm 5 Dynamic Programming for Exact Solution, Algorithm 6 Dynamic Programming for Approximate Solution, Algorithm 7 Framework for Dynamic Programming in MPC, Algorithm 8 LSH for Points in Euclidean Spaces. |
| Open Source Code | No | The paper does not contain any explicit statement or link indicating that the source code for the described methodology is publicly available. |
| Open Datasets | Yes | All datasets can be obtained by sklearn.datasets.fetch openml of the Python scikit-learn package. |
| Dataset Splits | No | The paper describes experiments on synthetic and real datasets but does not explicitly provide specific training/validation/test dataset splits. |
| Hardware Specification | Yes | We ran experiments on a machine with 16G RAM and Intel Core i7-3720QM@2.60GHz CPU. All experiments were in single threaded mode. |
| Software Dependencies | No | All codes are in C++. ... k-means solver using k-means++ seeding implemeted by Python scikit-learn package (Pedregosa et al., 2011). No specific version numbers for C++ compiler or scikit-learn are provided. |
| Experiment Setup | Yes | We use the same choice as (Datar et al., 2004), i.e., r = 4 and m = 10, and thus we choose t of Algorithm 1 to be 100. For Algorithm 2, instead of running Algorithm 1 for each scale 1.01i, we run for each scale 1.1i. For Algorithm 6, we choose ฮต = 0.1. |