On Robustness of Kernel Clustering
Authors: Bowei Yan, Purnamrita Sarkar
NeurIPS 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 5 Experiments In this section, we collect some numerical results. For implementation of the proposed SDP, we use Alternating Direction Method of Multipliers that is used in [1]. In each synthetic experiment, we generate n m inliers from r equal-sized clusters. The centers of the clusters are sparse and hidden in a p-dim noise. For each generated data set, we add in m observations of outliers. To capture the arbitrary nature of the outliers, we generate half the outliers by a random Gaussian with large variance (100 times of the signal variance), and the other half by a uniform distribution that scatters across all clusters. We compare Algorithm 1 with 1) k-means by Lloyd s algorithms; 2) kernel SVD and 3) kernel PCA by [19]. The evaluating metric is accuracy of inliers, i.e., number of correctly clustered nodes divided by the total number of inliers. To avoid the identification problem, we compare all permutations of the predicted labels to ground truth labels and record the best accuracy. Each set of parameter is run 10 replicates and the mean accuracy and standard deviation (shown as error bars) are reported. For all k-means used in the experiments we do 10 restarts and choose the one with smallest k-means loss. For each experiment, we change only one parameter and fix all the others. Figure 1 shows how the performance of different clustering algorithms change when (a) number of clusters, (b) number of outliers, (c) minimum distance between clusters, increase. The value of all parameters used are specified in the caption of the figure. Panel (a) shows the inlier accuracy for various methods as we increase number of clusters. It can be seen that with r growing, the performance of all methods deteriorate except for the SDP. We also examine the ℓ1 norm of X0 ˆX, which remains stable as the number of clusters increases. Panel (b) describes the trend with respect to number of outliers. The accuracy of SDP on inliers is almost unaffected by the number of outliers while other methods suffer with large m. Panel (c) compares the performance as the minimum distance between cluster centers changes. Both SDP and K-SVD are consistent as the distance increases. Compared with K-SVD, SDP achieves consistency faster and variates less across random runs, which matches the analysis given in Section 3. |
| Researcher Affiliation | Academia | Bowei Yan Department of Statistics and Data Sciences University of Texas at Austin Purnamrita Sarkar Department of Statistics and Data Sciences University of Texas at Austin |
| Pseudocode | Yes | Algorithm 1 SDP relaxation for kernel clustering; Algorithm 2 K-SVD (K-PCA, spectral clustering) |
| Open Source Code | No | The paper provides a link to an arXiv preprint (1https://arxiv.org/abs/1606.01869) for an extended version of the paper, but not for the source code. |
| Open Datasets | No | The paper describes generating synthetic data: |
| Dataset Splits | No | The paper describes generating synthetic data for experiments but does not explicitly mention train/validation/test dataset splits. It evaluates "inlier accuracy" but without details on data partitioning for reproducibility. |
| Hardware Specification | No | The paper does not provide any specific hardware details such as GPU/CPU models or memory specifications used for running the experiments. |
| Software Dependencies | No | The paper mentions using "Alternating Direction Method of Multipliers that is used in [1]" but does not provide specific version numbers for any software or libraries used in their implementation. |
| Experiment Setup | Yes | In each synthetic experiment, we generate n m inliers from r equal-sized clusters... For each generated data set, we add in m observations of outliers. To capture the arbitrary nature of the outliers, we generate half the outliers by a random Gaussian with large variance (100 times of the signal variance), and the other half by a uniform distribution that scatters across all clusters... For all k-means used in the experiments we do 10 restarts and choose the one with smallest k-means loss... The value of all parameters used are specified in the caption of the figure. Panel (a) shows the inlier accuracy for various methods as we increase number of clusters. It can be seen that with r growing, the performance of all methods deteriorate except for the SDP. We also examine the ℓ1 norm of X0 ˆX, which remains stable as the number of clusters increases. Panel (b) describes the trend with respect to number of outliers. The accuracy of SDP on inliers is almost unaffected by the number of outliers while other methods suffer with large m. Panel (c) compares the performance as the minimum distance between cluster centers changes. Both SDP and K-SVD are consistent as the distance increases. Compared with K-SVD, SDP achieves consistency faster and variates less across random runs, which matches the analysis given in Section 3. |