Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Clustering High Dimensional Categorical Data via Topographical Features
Authors: Chao Chen, Novi Quadrianto
ICML 2016 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments show that our principled method outperforms state-of-the-art clustering methods while also admits an embarrassingly parallel property. |
| Researcher Affiliation | Academia | Chao Chen EMAIL CUNY Queens College & Graduate Center, New York, NY, USA; Novi Quadrianto EMAIL SMi Le CLi Ni C, University of Sussex, Brighton, UK |
| Pseudocode | Yes | Algorithm 1 Discrete-Clustering; Algorithm 2 Compute-Next |
| Open Source Code | No | The paper does not provide an explicit statement or link indicating that the source code for the described methodology is publicly available. |
| Open Datasets | Yes | We use synthetic, UCI and biological datasets. See Table 1 for a summary of different datasets. UCI datasets. We use several categorical datasets from the UCI repository (Lichman, 2013)... Biological datasets. We use DNA barcoding datasets from (Kuksa & Pavlovic, 2009). |
| Dataset Splits | No | The paper does not provide specific details on training, validation, or test dataset splits. It only mentions providing the 'true number of clusters to K-Means, K-Modes and mixture models' for competing methods. |
| Hardware Specification | No | The paper mentions running times but does not specify any hardware details (e.g., CPU, GPU models, or memory specifications) used for the experiments. |
| Software Dependencies | No | The paper mentions using the 'pyMix package (Georgi et al., 2010)' and other algorithms/methods but does not provide specific version numbers for any software dependencies. |
| Experiment Setup | Yes | The only parameter we need is the scale parameter δ. Empirically, we observe δ = 1 is the best choice, although δ = 2 and δ = 3 also work well. For methods that depend on initialization, we run five times and report the mean score. To ensure TMode finishes in a reasonable amount of time, we restrict the tree degree to no more than eight during model estimation and use this degree-restricted tree for TMode method. |