Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

The Impact of Random Models on Clustering Similarity

Authors: Alexander J. Gates, Yong-Yeol Ahn

JMLR 2017 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The consequences of different random models are illustrated using synthetic examples, handwriting recognition, and gene expression data. We demonstrate that the choice of random model can have a drastic impact on the ranking of similar clustering pairs, and the evaluation of a clustering method with respect to a random baseline; thus, the choice of random clustering model should be carefully justified. In Section 6, these tasks are demonstrated in the context of several examples: a synthetic clustering example, K-means clustering of a handwritten digits data set (MNIST), and an evaluation of hierarchical clustering applied to gene expression data.
Researcher Affiliation Academia Alexander J. Gates EMAIL Yong-Yeol Ahn EMAIL Department of Informatics and Program in Cognitive Science Indiana University 919 East 10th Street Bloomington, IN 47408, USA
Pseudocode No The paper provides mathematical derivations and descriptions of algorithms (like preferential attachment model), but it does not include any clearly labeled pseudocode or algorithm blocks with structured steps.
Open Source Code Yes A package to computer the adjusted similarity measures is available on the author s github: https://github.com/ajgates42/clusim
Open Datasets Yes K-means clustering of a handwritten digits data set (MNIST) (Alimoglu and Alpaydin, 1996, see Appendix B.1 for details). Appendix B.1 Digits Data Set: The digits data set is bundled with the sci-kit learn source code and consists of 1, 797 images of 8 8 gray level pixels of handwritten digits. The data set was originally assembled in Alimoglu and Alpaydin (1996). We illustrate the dependence of adjusted similarity baseline on the choice of random model using a gene expression data set. Specifically, we use a collection of 35 cancer gene expression studies assembled in de Souto et al. (2008). Appendix B.2 Gene Expression Data Set: The data was assembled in de Souto et al. (2008) and is freely available from http://bioinformatics.rutgers.edu/Publications/de Souto2008c/index.html.
Dataset Splits No The paper uses the MNIST digits dataset and a gene expression dataset. For the MNIST dataset, it mentions a 'ground truth clustering corresponding to the digit' and applies K-means. For the gene expression data, it refers to '35 cancer gene expression studies' and compares derived clusterings to a 'reference clustering'. Neither case specifies conventional training, validation, or test splits with percentages or sample counts for reproducing experiments involving model training or evaluation splits.
Hardware Specification No The paper does not provide any specific details about the hardware (e.g., CPU, GPU models, memory) used for running the experiments.
Software Dependencies No The paper mentions 'sci-kit learn source code' and 'mpmath arbitrary precision library for Python developed by Johansson et al. (2013)'. While the mpmath reference includes '(version 0.18)', this is the library's general version, not explicitly stated as the version used in their experimental setup. No other software dependencies are mentioned with specific version numbers relevant to replicating their experiments.
Experiment Setup Yes We demonstrate the importance of the random ensemble assumption through a comparison of the clusterings uncovered by 400 runs of K-means on a collection of hand-written digits... The K-means clustering method groups elements so as to minimize the average (Euclidean) distance from the cluster centroid. In most scenarios, it uncovers clusterings with a pre-specified number of clusters (K). For our example, the digits naturally fall into 10 disjoint clusters... The Adjusted Rand index assuming Mperm is shown on the x-axis; positive scores (blue and pink points) denote the method performed better than the random baseline... Clusterings are identified via agglomerative hierarchical clustering using correlation to compute the average linkage between data points, a common clustering methodology in biology. Since hierarchical clustering produces a clustering with the user specified number of clusters...