Discriminative Similarity for Data Clustering

Authors: Yingzhen Yang, Ping Li

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In order to evaluate the performance of the proposed discriminative similarity, we propose a new clustering method using a kernel as the similarity function, CDS via unsupervised kernel classification (CDSK), with its effectiveness demonstrated by experimental results.
Researcher Affiliation Collaboration Yingzhen Yang School of Computing and Augmented Intelligence Arizona State University Tempe, AZ 85281, USA yingzhen.yang@asu.edu Ping Li Cognitive Computing Lab Baidu Research Bellevue, WA 98004, USA liping11@baidu.com Yingzhen Yang s work was conducted as a consulting researcher at Baidu Research Bellevue, WA, USA.
Pseudocode Yes Algorithm 1 Clustering by Discriminative Similarity via unsupervised Kernel classification (CDSK)
Open Source Code No The paper does not provide concrete access to source code (e.g., specific repository link, explicit code release statement, or mention of code in supplementary materials) for the methodology described.
Open Datasets Yes Datasets. We conduct experiments on the Yale face dataset, UCI Ionosphere dataset, the MNIST handwritten digits dataset and the Georgia Face dataset. The Yale face dataset has face images of 15 people with 11 images for each person. The Ionosphere data contains 351 points of dimensionality 34. The Georgia Face dataset contains images of 50 people, and each person is represented by 15 color images with cluttered background. COIL-20 dataset has 1440 images of size 32 32 for 20 objects with background removed in all images. The COIL-100 dataset contains 100 objects with 72 images of size 32 32 for each object. CMU PIE face data contains 11554 cropped face images of size 32 32 for 68 persons, and there are around 170 facial images for each person under different illumination and expressions. The UMIST face dataset is comprised of 575 images of size 112 92 for 20 people. CMU Multi-PIE (MPIE) data (Gross et al., 2010) contains 8916 facial images captured in four sessions. The MNIST handwritten digits database has a total number of 70000 samples of dimensionality 1024 for digits from 0 to 9. The digits are normalized and centered in a fixed-size image. The Extended Yale Face Database B (Yale-B) dataset contains face images for 38 subjects with 64 frontal face images taken under different illuminations for each subject. CIFAR-10 dataset consists of 50000 training images and 10000 testing images in 10 classes, and each image is a color one of size 32 32, and we perform data clustering using all the training and testing images. We also use the mini Image Net dataset used in Vinyals et al. (2016) to evaluate the potential of clustering methods. Mini Image Net consists of 60, 000 color images of size 84 84 with 100 classes, and each class has 600 images. Mini Image Net is known to be more complex than the CIFAR-10 dataset, and we perform clustering on the 64 classes in mini Image Net which are used for few-shot learning, so 38, 400 images are used for clustering.
Dataset Splits Yes Following the practice in Mairal et al. (2012), we randomly sampled 10% of the given data as the validation data, then perform CDSK on the validation data. The best λ is chosen among the discrete values between [0.05, 05] with a step of 0.05 which minimizes the average entropy of the obtained embedding matrix Y Rn c by Algorithm 1
Hardware Specification No The paper does not provide specific hardware details (e.g., exact GPU/CPU models, memory amounts, or detailed computer specifications) used for running its experiments.
Software Dependencies No The paper does not provide specific ancillary software details with version numbers.
Experiment Setup Yes M is set to 20 throughout all the experiments. In order to promote sparsity of α, α can be initialized by solving Pn i=1 xi P j =i xjαj 2 2 + τ α 0 for a positive weighting parameter τ = 0.1. ...the kernel bandwidth in all methods is set as the variance of the pairwise Euclidean distance between the data. λ is the weight for the regularization term in our derived generalization bound. ...λ can be tuned by Cross-Validation (CV). ...The best λ is chosen among the discrete values between [0.05, 05] with a step of 0.05 which minimizes the average entropy of the obtained embedding matrix Y Rn c by Algorithm 1