Wasserstein $K$-means for clustering probability distributions
Authors: Yubo Zhuang, Xiaohui Chen, Yun Yang
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our simulation and real data examples also demonstrate that distance-based K-means can achieve better classification performance over the standard centroid-based K-means for clustering probability distributions and images. |
| Researcher Affiliation | Academia | Yubo Zhuang, Xiaohui Chen, Yun Yang Department of Statistics University of Illinois at Urbana-Champaign {yubo2,xhchen,yy84}@illinois.edu |
| Pseudocode | No | The paper describes algorithmic steps using equations, but does not present them in a structured pseudocode or algorithm block format. |
| Open Source Code | Yes | Codes using MATLAB and Python implementing W-SDP and D-WKM are available at: https://github.com/Yubo02/ Wasserstein-K-means-for-clustering-probability-distributions. |
| Open Datasets | Yes | We consider three benchmark image datasets: MNIST, Fashion-MNIST and USPS handwriting digits. |
| Dataset Splits | Yes | We consider subsets of the whole datasets and randomly choose fixed number of images from each clusters based on 10 replicates for each cases. Here we used Sinkhorn divergence to calculate the Wasserstein distance and the Bregman projection with 100 iterations for computing the barycenters, which is efficient and stable for non-degenerate case in practice. For both Wasserstein K-means methods, we use the initialization method in analogue to the K-means++ for Euclidean data, i.e., the first cluster barycenter is chosen uniformly at random as one of the distributions, after which each subsequent cluster barycenter is chosen from the remaining distributions with probability proportional to its squared Sinkhorn divergence from the distribution s closest existing cluster barycenter. ... For MNIST dataset, we choose and fix two clusters: G 1 containing the number "0" and G 2 containing the number "5" for two cases. (1) In Case 1, we randomly draw 200 number "0" and 100 number of "5" for each repetition. (2) In Case 2, we double the number and randomly draw 400 number "0" and 200 number of "5" instead. For Fashion-MNIST and USPS handwriting digits dataset, we consider K = 3: G 1 containing the "T-shirt/top" (or the number "0" for USPS handwriting), G 2 containing the "Trouser" (or the number "5" for USPS handwriting) and G 3 containing the "Dress" (or the number "7" for USPS handwriting). The cluster sizes are unbalanced where we randomly choose 200, 100 and 100 number from G 1, G 2 and G 3 respectively. |
| Hardware Specification | No | The paper does not provide any specific hardware details used for running the experiments. |
| Software Dependencies | No | The paper mentions 'MATLAB and Python' as programming languages used, but does not provide specific version numbers for these or any other software dependencies or libraries. |
| Experiment Setup | Yes | Here we used Sinkhorn divergence to calculate the Wasserstein distance and the Bregman projection with 100 iterations for computing the barycenters, which is efficient and stable for non-degenerate case in practice. For both Wasserstein K-means methods, we use the initialization method in analogue to the K-means++ for Euclidean data, i.e., the first cluster barycenter is chosen uniformly at random as one of the distributions, after which each subsequent cluster barycenter is chosen from the remaining distributions with probability proportional to its squared Sinkhorn divergence from the distribution s closest existing cluster barycenter. ... And we set the perturbation parameter t = 10 3 on the covariance matrix. |