Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

The Impact of Random Models on Clustering Similarity

Authors: Alexander J. Gates, Yong-Yeol Ahn

JMLR 2017 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	The consequences of diﬀerent random models are illustrated using synthetic examples, handwriting recognition, and gene expression data. We demonstrate that the choice of random model can have a drastic impact on the ranking of similar clustering pairs, and the evaluation of a clustering method with respect to a random baseline; thus, the choice of random clustering model should be carefully justiﬁed. In Section 6, these tasks are demonstrated in the context of several examples: a synthetic clustering example, K-means clustering of a handwritten digits data set (MNIST), and an evaluation of hierarchical clustering applied to gene expression data.
Researcher Affiliation	Academia	Alexander J. Gates EMAIL Yong-Yeol Ahn EMAIL Department of Informatics and Program in Cognitive Science Indiana University 919 East 10th Street Bloomington, IN 47408, USA
Pseudocode	No	The paper provides mathematical derivations and descriptions of algorithms (like preferential attachment model), but it does not include any clearly labeled pseudocode or algorithm blocks with structured steps.
Open Source Code	Yes	A package to computer the adjusted similarity measures is available on the author s github: https://github.com/ajgates42/clusim
Open Datasets	Yes	K-means clustering of a handwritten digits data set (MNIST) (Alimoglu and Alpaydin, 1996, see Appendix B.1 for details). Appendix B.1 Digits Data Set: The digits data set is bundled with the sci-kit learn source code and consists of 1, 797 images of 8 8 gray level pixels of handwritten digits. The data set was originally assembled in Alimoglu and Alpaydin (1996). We illustrate the dependence of adjusted similarity baseline on the choice of random model using a gene expression data set. Speciﬁcally, we use a collection of 35 cancer gene expression studies assembled in de Souto et al. (2008). Appendix B.2 Gene Expression Data Set: The data was assembled in de Souto et al. (2008) and is freely available from http://bioinformatics.rutgers.edu/Publications/de Souto2008c/index.html.
Dataset Splits	No	The paper uses the MNIST digits dataset and a gene expression dataset. For the MNIST dataset, it mentions a 'ground truth clustering corresponding to the digit' and applies K-means. For the gene expression data, it refers to '35 cancer gene expression studies' and compares derived clusterings to a 'reference clustering'. Neither case specifies conventional training, validation, or test splits with percentages or sample counts for reproducing experiments involving model training or evaluation splits.
Hardware Specification	No	The paper does not provide any specific details about the hardware (e.g., CPU, GPU models, memory) used for running the experiments.
Software Dependencies	No	The paper mentions 'sci-kit learn source code' and 'mpmath arbitrary precision library for Python developed by Johansson et al. (2013)'. While the mpmath reference includes '(version 0.18)', this is the library's general version, not explicitly stated as the version used in their experimental setup. No other software dependencies are mentioned with specific version numbers relevant to replicating their experiments.
Experiment Setup	Yes	We demonstrate the importance of the random ensemble assumption through a comparison of the clusterings uncovered by 400 runs of K-means on a collection of hand-written digits... The K-means clustering method groups elements so as to minimize the average (Euclidean) distance from the cluster centroid. In most scenarios, it uncovers clusterings with a pre-speciﬁed number of clusters (K). For our example, the digits naturally fall into 10 disjoint clusters... The Adjusted Rand index assuming Mperm is shown on the x-axis; positive scores (blue and pink points) denote the method performed better than the random baseline... Clusterings are identiﬁed via agglomerative hierarchical clustering using correlation to compute the average linkage between data points, a common clustering methodology in biology. Since hierarchical clustering produces a clustering with the user speciﬁed number of clusters...