Sparse Embedded $k$-Means Clustering

Authors: Weiwei Liu, Xiaobo Shen, Ivor Tsang

NeurIPS 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our empirical studies corroborate our theoretical findings, and demonstrate that our approach is able to significantly accelerate k-means clustering, while achieving satisfactory clustering performance.
Researcher Affiliation Academia School of Computer Science and Engineering, The University of New South Wales School of Computer Science and Engineering, Nanyang Technological University Centre for Artificial Intelligence, University of Technology Sydney {liuweiwei863,njust.shenxiaobo}@gmail.com ivor.tsang@uts.edu.au
Pseudocode Yes Algorithm 1 Sparse Embedded k-Means Clustering Input: X Rn d. Number of clusters k. Output: ϵ-approximate solution for problem 1. 1: Set d = O(max( k+log(1/δ) ϵ2 , 6 ϵ2δ)). 2: Build a random map h so that for any i [d], h(i) = j for j [ d] with probability 1/ d. 3: Construct matrix Φ {0, 1}d d with Φi,h(i) = 1, and all remaining entries 0. 4: Construct matrix Q Rd d is a random diagonal matrix whose entries are i.i.d. Rademacher variables. 5: Compute the product ˆX = XQΦ and run exact or approximate k-means algorithms on ˆX.
Open Source Code No The paper mentions using code from websites for baseline methods (LLE, LS, PD, k-means) but does not provide a link or statement about the availability of their own proposed method's source code.
Open Datasets Yes This section evaluates the performance of the proposed method on four real-world data sets: COIL20, SECTOR, RCV1 and ILSVRC2012. The COIL20 [20] and ILSVRC2012 [21] data sets are collected from website34, and other data sets are collected from the LIBSVM website5. 3http://www.cs.columbia.edu/CAVE/software/softlib/coil-20.php 4http://www.image-net.org/challenges/LSVRC/2012/ 5https://www.csie.ntu.edu.tw/ cjlin/libsvmtools/datasets/
Dataset Splits No The paper evaluates performance on several datasets but does not explicitly detail training, validation, and test dataset splits, percentages, or sample counts for reproducibility of data partitioning.
Hardware Specification No The paper does not provide specific hardware details (e.g., CPU/GPU models, memory) used for running its experiments.
Software Dependencies No The paper mentions using a 'standard k-means clustering package' and references code for baselines, but does not provide specific version numbers for any ancillary software dependencies (e.g., libraries, frameworks) used for their implementation.
Experiment Setup No The paper mentions running baseline methods 'with default parameters' but does not specify concrete hyperparameters, training configurations, or system-level settings for its own proposed method or the overall experimental setup.