Wasserstein $K$-means for clustering probability distributions

Authors: Yubo Zhuang, Xiaohui Chen, Yun Yang

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our simulation and real data examples also demonstrate that distance-based K-means can achieve better classification performance over the standard centroid-based K-means for clustering probability distributions and images.
Researcher Affiliation Academia Yubo Zhuang, Xiaohui Chen, Yun Yang Department of Statistics University of Illinois at Urbana-Champaign {yubo2,xhchen,yy84}@illinois.edu
Pseudocode No The paper describes algorithmic steps using equations, but does not present them in a structured pseudocode or algorithm block format.
Open Source Code Yes Codes using MATLAB and Python implementing W-SDP and D-WKM are available at: https://github.com/Yubo02/ Wasserstein-K-means-for-clustering-probability-distributions.
Open Datasets Yes We consider three benchmark image datasets: MNIST, Fashion-MNIST and USPS handwriting digits.
Dataset Splits Yes We consider subsets of the whole datasets and randomly choose fixed number of images from each clusters based on 10 replicates for each cases. Here we used Sinkhorn divergence to calculate the Wasserstein distance and the Bregman projection with 100 iterations for computing the barycenters, which is efficient and stable for non-degenerate case in practice. For both Wasserstein K-means methods, we use the initialization method in analogue to the K-means++ for Euclidean data, i.e., the first cluster barycenter is chosen uniformly at random as one of the distributions, after which each subsequent cluster barycenter is chosen from the remaining distributions with probability proportional to its squared Sinkhorn divergence from the distribution s closest existing cluster barycenter. ... For MNIST dataset, we choose and fix two clusters: G 1 containing the number "0" and G 2 containing the number "5" for two cases. (1) In Case 1, we randomly draw 200 number "0" and 100 number of "5" for each repetition. (2) In Case 2, we double the number and randomly draw 400 number "0" and 200 number of "5" instead. For Fashion-MNIST and USPS handwriting digits dataset, we consider K = 3: G 1 containing the "T-shirt/top" (or the number "0" for USPS handwriting), G 2 containing the "Trouser" (or the number "5" for USPS handwriting) and G 3 containing the "Dress" (or the number "7" for USPS handwriting). The cluster sizes are unbalanced where we randomly choose 200, 100 and 100 number from G 1, G 2 and G 3 respectively.
Hardware Specification No The paper does not provide any specific hardware details used for running the experiments.
Software Dependencies No The paper mentions 'MATLAB and Python' as programming languages used, but does not provide specific version numbers for these or any other software dependencies or libraries.
Experiment Setup Yes Here we used Sinkhorn divergence to calculate the Wasserstein distance and the Bregman projection with 100 iterations for computing the barycenters, which is efficient and stable for non-degenerate case in practice. For both Wasserstein K-means methods, we use the initialization method in analogue to the K-means++ for Euclidean data, i.e., the first cluster barycenter is chosen uniformly at random as one of the distributions, after which each subsequent cluster barycenter is chosen from the remaining distributions with probability proportional to its squared Sinkhorn divergence from the distribution s closest existing cluster barycenter. ... And we set the perturbation parameter t = 10 3 on the covariance matrix.