Fast Distributed k-Center Clustering with Outliers on Massive Data

Authors: Gustavo Malkomes, Matt J. Kusner, Wenlin Chen, Kilian Q. Weinberger, Benjamin Moseley

NeurIPS 2015 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Finally, we perform experiments with both algorithms on real world datasets. For k-center we observe that the quality of the solutions is effectively the same as that of the sequential algorithm for all values of k the best one could hope for. For the k-center problem with outliers our algorithm matches the sequential algorithm as the values of k and z vary and it significantly outperforms the algorithm which does not explicitly consider outliers. Somewhat surprisingly our algorithm achieves an order of magnitude speed-up over the sequential algorithm even if it is run sequentially.
Researcher Affiliation Academia Gustavo Malkomes, Matt J. Kusner, Wenlin Chen Department of Computer Science and Engineering Washington University in St. Louis St. Louis, MO 63130 {luizgustavo,mkusner,wenlinchen}@wustl.edu Kilian Q. Weinberger Department of Computer Science Cornell University Ithaca, NY 14850 kqw4@cornell.edu Benjamin Moseley Department of Computer Science and Engineering Washington University in St. Louis St. Louis, MO 63130 bmoseley@wustl.edu
Pseudocode Yes Algorithm 1 Sequential k-center GREEDY(U, k) 1: X = 2: Add any point u U to X 3: while |X| < k do 4: u = argmaxv U d X(v) 5: X = X {u} 6: end while
Open Source Code No The text does not explicitly state that the source code for the methodology described in this paper is publicly available, nor does it provide a link to a code repository.
Open Datasets Yes Table 1: The clustering datasets (and their descriptions) used for evaluation. name description n dim. Parkinsons [28] patients with early-stage Parkinson s disease 5, 875 22 Census1 census household information 45, 222 12 Skin1 RGB-pixel samples from face images 245, 057 3 Yahoo [11] web-search ranking dataset (features are GBRT outputs [29]) 473, 134 500 Covertype1 a forest cover dataset with cartographic features 522, 911 13 Power1 household electric power readings 2, 049, 280 7 Higgs1 particle detector measurements (the seven high-level features) 11, 000, 000 7
Dataset Splits No The paper mentions varying parameters and using subsamples of datasets for evaluation but does not specify explicit training/validation/test dataset splits or cross-validation methodology.
Hardware Specification Yes All methods were implemented in MATLABTM and conducted on an 8-core Intel Xeon 2 GHz machine.
Software Dependencies No All methods were implemented in MATLABTM. The paper mentions the software used (MATLAB) but does not provide specific version numbers for it or any other software dependencies, which are required for reproducibility.
Experiment Setup Yes We vary all parameters: number of centers k, number of outliers z, and the number of machines m. We consider datasets for which computing the sequential methods is practical: Parkinsons, Census and two random subsamples (10, 000 inputs each) of Covertype and Power. We show the results in Figure 2. Each column contains the results for a single dataset and each row for a single varying parameter (k, z, or m), along with standard errors over 5 runs. When a parameter is not varied we fix k = 50, z = 256, and m = 10.