Von Mises-Fisher Clustering Models
Authors: Siddharth Gopal, Yiming Yang
ICML 2014 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments on six datasets provide strong empirical support in favour of v MF based clustering models over other popular tools such as K-means, Multinomial Mixtures and Latent Dirichlet Allocation. |
| Researcher Affiliation | Academia | Siddharth Gopal SGOPAL1@CS.CMU.EDU Carnegie Mellon University, Pittsburgh, PA 15213 USA Yiming Yang YIMING@CS.CMU.EDU Carnegie Mellon University, Pittsburgh, PA 15213 USA |
| Pseudocode | No | The paper describes inference schemes (EM, variational, collapsed gibbs sampling) but does not present them in structured pseudocode or algorithm blocks. |
| Open Source Code | No | No explicit statement or link providing access to open-source code for the described methodology was found. |
| Open Datasets | Yes | Throughout our experiments we used several popular benchmark datasets (Table 1) TDT-{4,5} (Allan et al., 1998) , CNAE 2 , K9 3, NEWS20 4 , and NIPS (Globerson et al., 2007). Footnotes provide links: '2 http://archive.ics.uci.edu/ml/datasets/CNAE-9', '3 http://www-users.cs.umn.edu/boley/ftp/PDDPdata/', '4 http://people.csail.mit.edu/jrennie/20Newsgroups/' |
| Dataset Splits | No | Table 1 provides '#Training' and '#Testing' counts for datasets, but no explicit information about a separate validation split or how it was used for hyperparameter tuning if applicable. |
| Hardware Specification | Yes | All our experiments were run on 48 core AMD opteron 6168 @ 1.92Ghz with 66GB RAM with full parallelization wherever possible. |
| Software Dependencies | No | The paper does not provide specific version numbers for software dependencies such as libraries or programming languages used in the experiments. |
| Experiment Setup | Yes | For all the methods, every instance is assigned to the cluster with largest probability before evaluation. Also, for the sake of a fair comparison, we set the same random initial values of the cluster-assignment variables in all the algorithms; the results of each method is averaged over 10 different starting values. [...] We used the Tf-Idf normalized data representation for v MFmix, KM and GM, and feature counts representation (without normalization) for MM. [...] For simplicity, we fix the number of clusters to 30. |