reproducibilityindex.ai

On Robustness of Kernel Clustering

Authors: Bowei Yan, Purnamrita Sarkar

NeurIPS 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	5 Experiments In this section, we collect some numerical results. For implementation of the proposed SDP, we use Alternating Direction Method of Multipliers that is used in [1]. In each synthetic experiment, we generate n m inliers from r equal-sized clusters. The centers of the clusters are sparse and hidden in a p-dim noise. For each generated data set, we add in m observations of outliers. To capture the arbitrary nature of the outliers, we generate half the outliers by a random Gaussian with large variance (100 times of the signal variance), and the other half by a uniform distribution that scatters across all clusters. We compare Algorithm 1 with 1) k-means by Lloyd s algorithms; 2) kernel SVD and 3) kernel PCA by [19]. The evaluating metric is accuracy of inliers, i.e., number of correctly clustered nodes divided by the total number of inliers. To avoid the identiﬁcation problem, we compare all permutations of the predicted labels to ground truth labels and record the best accuracy. Each set of parameter is run 10 replicates and the mean accuracy and standard deviation (shown as error bars) are reported. For all k-means used in the experiments we do 10 restarts and choose the one with smallest k-means loss. For each experiment, we change only one parameter and ﬁx all the others. Figure 1 shows how the performance of different clustering algorithms change when (a) number of clusters, (b) number of outliers, (c) minimum distance between clusters, increase. The value of all parameters used are speciﬁed in the caption of the ﬁgure. Panel (a) shows the inlier accuracy for various methods as we increase number of clusters. It can be seen that with r growing, the performance of all methods deteriorate except for the SDP. We also examine the ℓ1 norm of X0 ˆX, which remains stable as the number of clusters increases. Panel (b) describes the trend with respect to number of outliers. The accuracy of SDP on inliers is almost unaffected by the number of outliers while other methods suffer with large m. Panel (c) compares the performance as the minimum distance between cluster centers changes. Both SDP and K-SVD are consistent as the distance increases. Compared with K-SVD, SDP achieves consistency faster and variates less across random runs, which matches the analysis given in Section 3.
Researcher Affiliation	Academia	Bowei Yan Department of Statistics and Data Sciences University of Texas at Austin Purnamrita Sarkar Department of Statistics and Data Sciences University of Texas at Austin
Pseudocode	Yes	Algorithm 1 SDP relaxation for kernel clustering; Algorithm 2 K-SVD (K-PCA, spectral clustering)
Open Source Code	No	The paper provides a link to an arXiv preprint (1https://arxiv.org/abs/1606.01869) for an extended version of the paper, but not for the source code.
Open Datasets	No	The paper describes generating synthetic data:
Dataset Splits	No	The paper describes generating synthetic data for experiments but does not explicitly mention train/validation/test dataset splits. It evaluates "inlier accuracy" but without details on data partitioning for reproducibility.
Hardware Specification	No	The paper does not provide any specific hardware details such as GPU/CPU models or memory specifications used for running the experiments.
Software Dependencies	No	The paper mentions using "Alternating Direction Method of Multipliers that is used in [1]" but does not provide specific version numbers for any software or libraries used in their implementation.
Experiment Setup	Yes	In each synthetic experiment, we generate n m inliers from r equal-sized clusters... For each generated data set, we add in m observations of outliers. To capture the arbitrary nature of the outliers, we generate half the outliers by a random Gaussian with large variance (100 times of the signal variance), and the other half by a uniform distribution that scatters across all clusters... For all k-means used in the experiments we do 10 restarts and choose the one with smallest k-means loss... The value of all parameters used are speciﬁed in the caption of the ﬁgure. Panel (a) shows the inlier accuracy for various methods as we increase number of clusters. It can be seen that with r growing, the performance of all methods deteriorate except for the SDP. We also examine the ℓ1 norm of X0 ˆX, which remains stable as the number of clusters increases. Panel (b) describes the trend with respect to number of outliers. The accuracy of SDP on inliers is almost unaffected by the number of outliers while other methods suffer with large m. Panel (c) compares the performance as the minimum distance between cluster centers changes. Both SDP and K-SVD are consistent as the distance increases. Compared with K-SVD, SDP achieves consistency faster and variates less across random runs, which matches the analysis given in Section 3.