K-means clustering using random matrix sparsification

Authors: Kaushik Sinha

ICML 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our empirical studies on three real world datasets corroborate our theoretical findings and demonstrate that our proposed sparsification method can indeed achieve satisfactory clustering performance. In this section we present empirical evaluation of our proposed algorithm on three real world datasets: USPS, RCV1 and TDT2.
Researcher Affiliation Academia Department of Electrical Engineering & Computer Science, Wichita State University, KS, USA.
Pseudocode Yes Algorithm 1 K-means clustering using random sparsification Input : Data matrix A Rn d, number of clusters k, a positive scalar p and a γ-approximation k-means algorithm. Output : Cluster indicator matrix X γ determining a k partition of the rows of A.
Open Source Code No The paper does not provide any explicit statement or a link to open-source code for the methodology described.
Open Datasets Yes In this section we present empirical evaluation of our proposed algorithm on three real world datasets: USPS, RCV1 and TDT2. The USPS dataset (Hull, 1994)... The RCV1 dataset (Lewis et al., 2004)... We use a smaller subset of this dataset available from LIBSVM webpage (LIB) containing 15, 564 news articles from 53 categories. The TDT2 dataset (Cieri et al., 1999)... LIBSVM Data: M-ulti-class (classification). URL https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass.html. Last accessed: 02-05-2018.
Dataset Splits No The paper performs clustering on the full datasets and evaluates clustering quality. It mentions 'We repeat this clustering 30 times, each time initializing cluster center using k-means++ and selecting the final clustering as the one with lowest k-means objective,' which describes a method to find a good local minimum for K-means, but not a standard train/validation/test split for reproducibility of model training or selection.
Hardware Specification No The paper does not specify any hardware details (e.g., GPU/CPU models, processor types, memory amounts) used for running the experiments.
Software Dependencies No The paper mentions 'To apply Lloyd s heuristic for k-means clustering we use Matlab s kmeans function' but does not provide specific version numbers for Matlab or the function itself, nor for any other software dependencies.
Experiment Setup Yes To apply Lloyd s heuristic for k-means clustering we use Matlab s kmeans function which, by default, uses kmeans++ algorithm (Arthur & Vassilvitskii, 2007) for cluster center initialization. We repeat this clustering 30 times, each time initializing cluster center using k-means++ and selecting the final clustering as the one with lowest k-means objective. In our experiments, we vary p from 0.01 to 1.0 in steps of 0.01 and for each value of p, the number of of non-zero entries in A obtained due to non-uniform sampling is denoted by q.