Von Mises-Fisher Clustering Models

Authors: Siddharth Gopal, Yiming Yang

ICML 2014 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments on six datasets provide strong empirical support in favour of v MF based clustering models over other popular tools such as K-means, Multinomial Mixtures and Latent Dirichlet Allocation.
Researcher Affiliation Academia Siddharth Gopal SGOPAL1@CS.CMU.EDU Carnegie Mellon University, Pittsburgh, PA 15213 USA Yiming Yang YIMING@CS.CMU.EDU Carnegie Mellon University, Pittsburgh, PA 15213 USA
Pseudocode No The paper describes inference schemes (EM, variational, collapsed gibbs sampling) but does not present them in structured pseudocode or algorithm blocks.
Open Source Code No No explicit statement or link providing access to open-source code for the described methodology was found.
Open Datasets Yes Throughout our experiments we used several popular benchmark datasets (Table 1) TDT-{4,5} (Allan et al., 1998) , CNAE 2 , K9 3, NEWS20 4 , and NIPS (Globerson et al., 2007). Footnotes provide links: '2 http://archive.ics.uci.edu/ml/datasets/CNAE-9', '3 http://www-users.cs.umn.edu/boley/ftp/PDDPdata/', '4 http://people.csail.mit.edu/jrennie/20Newsgroups/'
Dataset Splits No Table 1 provides '#Training' and '#Testing' counts for datasets, but no explicit information about a separate validation split or how it was used for hyperparameter tuning if applicable.
Hardware Specification Yes All our experiments were run on 48 core AMD opteron 6168 @ 1.92Ghz with 66GB RAM with full parallelization wherever possible.
Software Dependencies No The paper does not provide specific version numbers for software dependencies such as libraries or programming languages used in the experiments.
Experiment Setup Yes For all the methods, every instance is assigned to the cluster with largest probability before evaluation. Also, for the sake of a fair comparison, we set the same random initial values of the cluster-assignment variables in all the algorithms; the results of each method is averaged over 10 different starting values. [...] We used the Tf-Idf normalized data representation for v MFmix, KM and GM, and feature counts representation (without normalization) for MM. [...] For simplicity, we fix the number of clusters to 30.