Compressed K-Means for Large-Scale Clustering

Authors: Xiaobo Shen, Weiwei Liu, Ivor Tsang, Fumin Shen, Quan-Sen Sun

AAAI 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experimental results on four large-scale datasets, including two million-scale datasets demonstrate that CKM outperforms the state-of-the-art large-scale clustering methods in terms of both computation and memory cost, while achieving comparable clustering accuracy.
Researcher Affiliation Academia Xiaobo Shen, Weiwei Liu, Ivor Tsang, Fumin Shen, Quan-Sen Sun School of Computer and Engineering, Nanjing University of Science and Technology Centre for Artificial Intelligence, University of Technology Sydney School of Computer Science and Engineering, University of Electronic Science and Technology of China {njust.shenxiaobo, liuweiwei863, fumin.shen}@gmail.com, ivor.tsang@uts.edu.au, sunquansen@njust.edu.cn
Pseudocode Yes Algorithm 1 Compressed k-means
Open Source Code No The paper mentions that code for *comparison methods* (Nyström, LSC-K) is available online, but there is no explicit statement or link provided for the authors' own CKM implementation.
Open Datasets Yes RCV11: a subset (Chen et al. 2011) of an archive of 804414 manually categorized newswire stories from Reuters Ltd. It has 193844 documents in 103 categories. Following previous studies (Wang et al. 2011a), we remove the keywords (features) appearing less than 100 times in the corpus, which results in 1979 (out of 47236) keywords in our experiment. Cov Type2: consists of 581012 instances for predicting forest cover type from cartographic variables. Each sample belongs to one of seven types (classes). ILSVRC20123: a subset of Image Net (Deng et al. 2009). It contains 1000 object categories and more than 1.2 million images. As in (Lin, Shen, and van den Hengel 2015), we use the 4096-dimensional features extracted by the convolution neural networks (CNN) model (Krizhevsky, Sutskever, and Hinton 2012) to represent the images. MNIST8M4: consists of around 8.1 million images of handwritten digits from 0 to 9. The feature is the same as MNIST dataset: 784-dimensional original pixel values.
Dataset Splits No The paper mentions using a 'training set X' in Algorithm 1 and refers to 'selected subset β', but does not specify explicit train/validation/test splits, percentages, or sample counts needed to reproduce the data partitioning for their experiments.
Hardware Specification Yes All the computations reported in this study are performed on a Red Hat Enterprise 64-Bit Linux workstation with 18-core Intel Xeon CPU E5-2680 2.80 GHz and 256 GB memory.
Software Dependencies No The paper does not provide specific version numbers for any software dependencies used in their implementation of CKM. It only mentions 'Matlab version' for comparison methods without specifying the version.
Experiment Setup Yes For the binary coding based methods, the binary code length r is set as 32 for Cov Type, and 128 for the other three high-dimensional datasets. For the proposed CKM, we empirically set the ratio of the selected subset β as 0.01, parameter α as 10, and ν as 1.