reproducibilityindex.ai

Von Mises-Fisher Clustering Models

Authors: Siddharth Gopal, Yiming Yang

ICML 2014 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments on six datasets provide strong empirical support in favour of v MF based clustering models over other popular tools such as K-means, Multinomial Mixtures and Latent Dirichlet Allocation.
Researcher Affiliation	Academia	Siddharth Gopal SGOPAL1@CS.CMU.EDU Carnegie Mellon University, Pittsburgh, PA 15213 USA Yiming Yang YIMING@CS.CMU.EDU Carnegie Mellon University, Pittsburgh, PA 15213 USA
Pseudocode	No	The paper describes inference schemes (EM, variational, collapsed gibbs sampling) but does not present them in structured pseudocode or algorithm blocks.
Open Source Code	No	No explicit statement or link providing access to open-source code for the described methodology was found.
Open Datasets	Yes	Throughout our experiments we used several popular benchmark datasets (Table 1) TDT-{4,5} (Allan et al., 1998) , CNAE 2 , K9 3, NEWS20 4 , and NIPS (Globerson et al., 2007). Footnotes provide links: '2 http://archive.ics.uci.edu/ml/datasets/CNAE-9', '3 http://www-users.cs.umn.edu/boley/ftp/PDDPdata/', '4 http://people.csail.mit.edu/jrennie/20Newsgroups/'
Dataset Splits	No	Table 1 provides '#Training' and '#Testing' counts for datasets, but no explicit information about a separate validation split or how it was used for hyperparameter tuning if applicable.
Hardware Specification	Yes	All our experiments were run on 48 core AMD opteron 6168 @ 1.92Ghz with 66GB RAM with full parallelization wherever possible.
Software Dependencies	No	The paper does not provide specific version numbers for software dependencies such as libraries or programming languages used in the experiments.
Experiment Setup	Yes	For all the methods, every instance is assigned to the cluster with largest probability before evaluation. Also, for the sake of a fair comparison, we set the same random initial values of the cluster-assignment variables in all the algorithms; the results of each method is averaged over 10 different starting values. [...] We used the Tf-Idf normalized data representation for v MFmix, KM and GM, and feature counts representation (without normalization) for MM. [...] For simplicity, we ﬁx the number of clusters to 30.