Interpretable Deep Clustering for Tabular Data

Authors: Jonathan Svirsky, Ofir Lindenbaum

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We show that the proposed method can reliably predict cluster assignments in biological, text, image, and physics tabular datasets. Furthermore, using previously proposed metrics, we verify that our model leads to interpretable results at a sample and cluster level. Our code is available on Github. 1. Introduction Clustering is a crucial task in data science that helps researchers uncover and study latent structures in complex data. By grouping related data points into clusters, researchers can gain insights into the underlying characteristics of the data and identify relationships between samples and variables. Clustering is used in various scientific fields, including biology (Reddy et al., 2018), physics (Mikuni & Canelli, 2021), and social sciences (Varghese et al., 2010). For instance, in biology, clustering can identify different *Equal contribution 1Department of Engineering, Bar Ilan University, Ramat-Gan, Israel. Correspondence to: Jonathan Svirsky <svirskj@biu.ac.il>, Ofir Lindenbaum <ofir.lindenbaum@biu.ac.il>. Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s). disease subtypes based on molecular or genetic data. In psychology, based on survey data, clustering can identify different types of behavior or personality traits. Clustering is a common technique used in bio-medicine to analyze gene expression data. It involves identifying groups of genes that have similar expression patterns across different samples. Scientists often cluster high-dimensional points corresponding to individual cells to recover known cell populations and discover new, potentially rare cell types. However, bio-med gene expression data is generally represented in a tabular, high-dimensional format, making it difficult to obtain accurate clusters with meaningful structures. In addition, interpretability is a crucial requirement for real-world bio-med datasets since it is essential to understand the biological meaning behind the identified clusters. As a result, there is an increasing demand in bio-medicine for clustering models that offer interpretability for tabular data.
Researcher Affiliation Academia Jonathan Svirsky 1 Ofir Lindenbaum 1 1Department of Engineering, Bar Ilan University, Ramat-Gan, Israel. Correspondence to: Jonathan Svirsky <svirskj@biu.ac.il>, Ofir Lindenbaum <ofir.lindenbaum@biu.ac.il>.
Pseudocode No The paper describes its method in prose and provides a high-level illustration in Figure 1, but it does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks or sections.
Open Source Code Yes Our code is available on Github.
Open Datasets Yes MNIST10K and MNIST60K are the subsets of MNIST (lec) dataset... Fashion MNIST60K is a train set of (git). The images include 10 categories of clothers. TOX-171 (Piloto & Schilling, 2010) dataset... ALLAML dataset (Golub et al., 1999)... PROSTATE dataset (Singh et al., 2002)... SRBCT dataset (Kar et al., 2015)... BIASE dataset (Biase et al., 2014)... INTESTINE (Sato et al., 2009)... PBMC-2 dataset is a binary-class subset of the original PBMC (Zheng et al., 2017)... CIFAR10 dataset (Krizhevsky et al., 2009)... cnae-9 (Bischl et al., 2017; Ciarelli & Oliveira, 2009)... MFEATZ is known as mfeat-zernike (van Breukelen et al., 1998; Bischl et al., 2017)... Mini Boo NE is a physical dataset for particle identification task (Roe et al., 2005)... ALBERT is a text dataset from Auto ML challenge (Guyon et al., 2019).
Dataset Splits No The paper mentions using 'MNIST test set split for evaluations' and refers to 'MNIST60K' and 'Fashion MNIST60K' as train sets. However, it does not specify the explicit train/validation/test splits (e.g., percentages or absolute sample counts for each) for all datasets used, nor does it cite predefined validation splits. For example, for MNIST10K, it states 'with 1K images for each category' but no indication of how this dataset was split into training and validation sets for their experiments.
Hardware Specification Yes We implement our model in Pytorch and run experiments on Nvidia A100 GPU server with Intel(R) Xeon(R) Gold 6338 CPU @ 2.00GHz.
Software Dependencies No The paper states, 'We implement our model in Pytorch', but it does not specify the version number for PyTorch or any other software libraries or dependencies used in the experiments.
Experiment Setup Yes Table 6. The number of epochs and batch size for different datasets. Dataset Epochs Stage 1 Epochs Stage 2 Batch size Synthetic 50 2000 800 MNIST60K 300 600 256 MNIST10K 300 700 100 Fashion MNIST 100 500 256 TOX-171 1000 1000 16 ALLAML 1000 1000 36 PROSTATE 1000 1000 102 SRBCT 2000 1000 83 BIASE 10000 1000 56 INTESTINE 5000 1000 238 PBMC-2 100 100 256 CNAE-9 1000 1000 500 MFEATZ 1000 1000 500 Mini Boo NE 20 30 512 ALBERT 10 40 1024 CIFAR-10 600 700 256