reproducibilityindex.ai

K-means clustering using random matrix sparsification

Authors: Kaushik Sinha

ICML 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our empirical studies on three real world datasets corroborate our theoretical ﬁndings and demonstrate that our proposed sparsiﬁcation method can indeed achieve satisfactory clustering performance. In this section we present empirical evaluation of our proposed algorithm on three real world datasets: USPS, RCV1 and TDT2.
Researcher Affiliation	Academia	Department of Electrical Engineering & Computer Science, Wichita State University, KS, USA.
Pseudocode	Yes	Algorithm 1 K-means clustering using random sparsiﬁcation Input : Data matrix A Rn d, number of clusters k, a positive scalar p and a γ-approximation k-means algorithm. Output : Cluster indicator matrix X γ determining a k partition of the rows of A.
Open Source Code	No	The paper does not provide any explicit statement or a link to open-source code for the methodology described.
Open Datasets	Yes	In this section we present empirical evaluation of our proposed algorithm on three real world datasets: USPS, RCV1 and TDT2. The USPS dataset (Hull, 1994)... The RCV1 dataset (Lewis et al., 2004)... We use a smaller subset of this dataset available from LIBSVM webpage (LIB) containing 15, 564 news articles from 53 categories. The TDT2 dataset (Cieri et al., 1999)... LIBSVM Data: M-ulti-class (classiﬁcation). URL https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass.html. Last accessed: 02-05-2018.
Dataset Splits	No	The paper performs clustering on the full datasets and evaluates clustering quality. It mentions 'We repeat this clustering 30 times, each time initializing cluster center using k-means++ and selecting the ﬁnal clustering as the one with lowest k-means objective,' which describes a method to find a good local minimum for K-means, but not a standard train/validation/test split for reproducibility of model training or selection.
Hardware Specification	No	The paper does not specify any hardware details (e.g., GPU/CPU models, processor types, memory amounts) used for running the experiments.
Software Dependencies	No	The paper mentions 'To apply Lloyd s heuristic for k-means clustering we use Matlab s kmeans function' but does not provide specific version numbers for Matlab or the function itself, nor for any other software dependencies.
Experiment Setup	Yes	To apply Lloyd s heuristic for k-means clustering we use Matlab s kmeans function which, by default, uses kmeans++ algorithm (Arthur & Vassilvitskii, 2007) for cluster center initialization. We repeat this clustering 30 times, each time initializing cluster center using k-means++ and selecting the ﬁnal clustering as the one with lowest k-means objective. In our experiments, we vary p from 0.01 to 1.0 in steps of 0.01 and for each value of p, the number of of non-zero entries in A obtained due to non-uniform sampling is denoted by q.