Clustering High Dimensional Categorical Data via Topographical Features

Authors: Chao Chen, Novi Quadrianto

ICML 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments show that our principled method outperforms state-of-the-art clustering methods while also admits an embarrassingly parallel property.
Researcher Affiliation Academia Chao Chen CHAO.CHEN@QC.CUNY.EDU CUNY Queens College & Graduate Center, New York, NY, USA; Novi Quadrianto N.QUADRIANTO@SUSSEX.AC.UK SMi Le CLi Ni C, University of Sussex, Brighton, UK
Pseudocode Yes Algorithm 1 Discrete-Clustering; Algorithm 2 Compute-Next
Open Source Code No The paper does not provide an explicit statement or link indicating that the source code for the described methodology is publicly available.
Open Datasets Yes We use synthetic, UCI and biological datasets. See Table 1 for a summary of different datasets. UCI datasets. We use several categorical datasets from the UCI repository (Lichman, 2013)... Biological datasets. We use DNA barcoding datasets from (Kuksa & Pavlovic, 2009).
Dataset Splits No The paper does not provide specific details on training, validation, or test dataset splits. It only mentions providing the 'true number of clusters to K-Means, K-Modes and mixture models' for competing methods.
Hardware Specification No The paper mentions running times but does not specify any hardware details (e.g., CPU, GPU models, or memory specifications) used for the experiments.
Software Dependencies No The paper mentions using the 'pyMix package (Georgi et al., 2010)' and other algorithms/methods but does not provide specific version numbers for any software dependencies.
Experiment Setup Yes The only parameter we need is the scale parameter δ. Empirically, we observe δ = 1 is the best choice, although δ = 2 and δ = 3 also work well. For methods that depend on initialization, we run five times and report the mean score. To ensure TMode finishes in a reasonable amount of time, we restrict the tree degree to no more than eight during model estimation and use this degree-restricted tree for TMode method.