Adaptive Manifold Regularized Matrix Factorization for Data Clustering

Authors: Lefei Zhang, Qian Zhang, Bo Du, Jane You, Dacheng Tao

IJCAI 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The experimental results on standard clustering datasets demonstrate the superior performance over the exist alternatives. The data clustering experiments are conducted on ten public available benchmark datasets, including five image datasets (Caltech101 Silhouettes, COIL20, MSRA25, Palm Data25, and USPS) and five non-image datasets from the UCI machine learning repository (Control, Dermatology, Ecoli, Movement, and Seeds). For each dataset, the data is formatted as a feature matrix (i.e., input X of AMRMF algorithm) with the size of number of features and number of samples, and an additional ground truth vector with the size of number of samples. The statistics of the datasets used in the clustering experiments are summarized in Table 1. After the AMRMF algorithm has ended, the predicted label vector is obtained from the cluster indicator (i.e., output U of AMRMF algorithm), and then compared with the ground truth vector for quantitative evaluation. The following two well accepted measurements have been used as metrics: Clustering accuracy (ACC), which discovers the one-to-one relationship between clusters and classes: ACC = Pn i=1 δ(map(ri), li) where ri and li are predicted and ground truth label of sample xi, respectively, δ(x, y) is the delta function that equals 1 if x = y and equals 0 otherwise, and map(ri) is the permutation mapping function that maps each cluster ri to the equivalent label from the dataset. Normalized mutual information (NMI), which measures the the quality of clusters: Pk i=1 Pk j=1 ni,jlog ni,j (Pk i=1 nilog ni n )(Pk j=1 ˆnjlog ˆnj where ni denotes the number of data contained in the cluster Ci(1 i k), ˆnj is the number of data belonging to the Lj(1 j k), and ni,j denotes the number Tables 2 and 3 summarize the clustering performance for each method on ten datasets. We can see that AMRMF algorithm outperforms other clustering methods in most of the cases. Finally, we would like to present the convergence performance of the proposed AMRMF optimization on the ten datasets, as illustrated in Figure 1.
Researcher Affiliation Collaboration Lefei Zhang1, Qian Zhang2, Bo Du1 , Jane You3, Dacheng Tao4 1 School of Computer, Wuhan University, 2 Alibaba Group, 3 Department of Computing, The Hong Kong Polytechnic University, 4 UBTech Sydney AI Institute, The School of Information Technologies, The University of Sydney, zhanglefei@whu.edu.cn, qianzhang.zq@alibaba-inc.com, remoteking@whu.edu.cn, csyjia@comp.polyu.edu.hk, dacheng.tao@sydney.edu.au
Pseudocode No The paper describes the optimization steps for the proposed model (e.g., 'Update S', 'Update U', 'Update V', 'Update E', 'Update Z', 'Update ALM Parameters') but does not present them in a structured pseudocode or algorithm block (e.g., 'Algorithm 1').
Open Source Code No The paper mentions that codes for comparison methods (k-means, NCuts, LRR, LSR, RSS, SSC) were 'downloaded from the authors webpages', but it does not state that the code for their own proposed AMRMF method is open-source or publicly available.
Open Datasets Yes The data clustering experiments are conducted on ten public available benchmark datasets, including five image datasets (Caltech101 Silhouettes, COIL20, MSRA25, Palm Data25, and USPS) and five non-image datasets from the UCI machine learning repository (Control, Dermatology, Ecoli, Movement, and Seeds). The statistics of the datasets used in the clustering experiments are summarized in Table 1.
Dataset Splits No The paper describes using ground truth labels for quantitative evaluation of clustering performance on the datasets. However, it does not specify explicit train/validation/test splits, percentages, or methodology for partitioning the data into such subsets, which is typical for supervised learning tasks. Clustering is an unsupervised task, often evaluated on the entire dataset against ground truth.
Hardware Specification No The paper does not provide any specific details regarding the hardware used to conduct the experiments, such as CPU or GPU models, memory, or cloud computing specifications.
Software Dependencies Yes In the our experiment, we compare the AMRMF algorithm with k-means, Normalized Cuts (NCuts) [Shi and Malik, 2000], low-rank representation (LRR) [Liu et al., 2013], least squares regression (LSR) [Lu et al., 2012], robust subspace segmentation (RSS) [Guo, 2015], and sparse subspace clustering (SSC) [Elhamifar and Vidal, 2013]. Among which, the k-means is executed by the Matlab R2015b statistical toolbox, while the codes of others are downloaded from the authors webpages.
Experiment Setup Yes To obtain the best possible performance of the compared methods, the detailed experimental settings are as follows. For k-means and NCuts, we repeat the experiments ten times and report the average results with standard deviations. In particular, we tune the Gaussian kernel parameter in the range of 10[ 5:5] for NCuts. For the released codes of LRR, LSR, RSS, and SSC, since the authors have fixed the detailed parameters in the last step of k-means clustering (e.g., initialization, distance measure and number of repeats), the algorithms output relatively stable results so we need not to average their results. In detail, for LRR, we have tried the two versions of LRR uploaded by the authors, with the λ [0.001, 0.01, 0.02, 0.05, 0.1, 0.2, 1, 2, 5], and report the best results. For LSR, we have also tested two implementations by the authors, with the λ 2[ 5:5] to find the best performance. For RSS, we have tuned the three regularizer weights in the same range of 2[ 3:3], respectively, which is a much wider range than the author recommended. For SSC, the parameter spaces are α 10[ 5:5], ρ [1 : 5], respectively. Finally, for the proposed AMRMF algorithm, we have tuned the regularization parameters in the same range of 10[ 5:5]. Note that our proposed AMRMF algorithm outputs stable result by the suggested optimization steps, therefore, we also need not to average the reported results.