reproducibilityindex.ai

Fast Distributed k-Center Clustering with Outliers on Massive Data

Authors: Gustavo Malkomes, Matt J. Kusner, Wenlin Chen, Kilian Q. Weinberger, Benjamin Moseley

NeurIPS 2015 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Finally, we perform experiments with both algorithms on real world datasets. For k-center we observe that the quality of the solutions is effectively the same as that of the sequential algorithm for all values of k the best one could hope for. For the k-center problem with outliers our algorithm matches the sequential algorithm as the values of k and z vary and it signiﬁcantly outperforms the algorithm which does not explicitly consider outliers. Somewhat surprisingly our algorithm achieves an order of magnitude speed-up over the sequential algorithm even if it is run sequentially.
Researcher Affiliation	Academia	Gustavo Malkomes, Matt J. Kusner, Wenlin Chen Department of Computer Science and Engineering Washington University in St. Louis St. Louis, MO 63130 {luizgustavo,mkusner,wenlinchen}@wustl.edu Kilian Q. Weinberger Department of Computer Science Cornell University Ithaca, NY 14850 kqw4@cornell.edu Benjamin Moseley Department of Computer Science and Engineering Washington University in St. Louis St. Louis, MO 63130 bmoseley@wustl.edu
Pseudocode	Yes	Algorithm 1 Sequential k-center GREEDY(U, k) 1: X = 2: Add any point u U to X 3: while \|X\| < k do 4: u = argmaxv U d X(v) 5: X = X {u} 6: end while
Open Source Code	No	The text does not explicitly state that the source code for the methodology described in this paper is publicly available, nor does it provide a link to a code repository.
Open Datasets	Yes	Table 1: The clustering datasets (and their descriptions) used for evaluation. name description n dim. Parkinsons [28] patients with early-stage Parkinson s disease 5, 875 22 Census1 census household information 45, 222 12 Skin1 RGB-pixel samples from face images 245, 057 3 Yahoo [11] web-search ranking dataset (features are GBRT outputs [29]) 473, 134 500 Covertype1 a forest cover dataset with cartographic features 522, 911 13 Power1 household electric power readings 2, 049, 280 7 Higgs1 particle detector measurements (the seven high-level features) 11, 000, 000 7
Dataset Splits	No	The paper mentions varying parameters and using subsamples of datasets for evaluation but does not specify explicit training/validation/test dataset splits or cross-validation methodology.
Hardware Specification	Yes	All methods were implemented in MATLABTM and conducted on an 8-core Intel Xeon 2 GHz machine.
Software Dependencies	No	All methods were implemented in MATLABTM. The paper mentions the software used (MATLAB) but does not provide specific version numbers for it or any other software dependencies, which are required for reproducibility.
Experiment Setup	Yes	We vary all parameters: number of centers k, number of outliers z, and the number of machines m. We consider datasets for which computing the sequential methods is practical: Parkinsons, Census and two random subsamples (10, 000 inputs each) of Covertype and Power. We show the results in Figure 2. Each column contains the results for a single dataset and each row for a single varying parameter (k, z, or m), along with standard errors over 5 runs. When a parameter is not varied we ﬁx k = 50, z = 256, and m = 10.