Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Online Trans-dimensional von Mises-Fisher Mixture Models for User Profiles

Authors: Xiangju Qin, Pádraig Cunningham, Michael Salter-Townshend

JMLR 2016 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirical results on simulated and real-world data show that the proposed Tv MF mixture models can discover more interpretable and intuitive clusters than other widely-used models, such as k-means, non-negative matrix factorization (NMF), Dirichlet process Gaussian mixture models (DP-GMM), and dynamic topic models (DTM). We further evaluate the performance of proposed models in real-world applications, such as the churn prediction task, that shows the usefulness of the features generated.
Researcher Affiliation	Academia	Xiangju Qin EMAIL P adraig Cunningham EMAIL School of Computer Science University College Dublin Belﬁeld, Dublin 4, Ireland Michael Salter-Townshend EMAIL Department of Statistics University of Oxford 24-29 St Giles, Oxford, OX1 3LB, UK
Pseudocode	Yes	Algorithm 1: Collapsed Gibbs sampling for Tv MFMM ... Algorithm 2: Parallel sampling for latent variables Z
Open Source Code	No	The paper mentions using 'mov MF software provided by Banerjee et al. (2005)' and implementations 'in the scikit-learn package' for comparison, but does not provide a statement or link for the source code of their own proposed methodology.
Open Datasets	Yes	We used the mov MF software provided by Banerjee et al. (2005) to generate synthetic data with: a) 4 wellseparated components; b) 5 well-separated components; c) 7 not well-separated components. ... For this purpose, we parsed the May 2014 dump of English Wikipedia, collected the edit activity of all registered users, then aggregated the edit activity of each user on a quarterly basis.
Dataset Splits	Yes	Each of the synthetic datasets has a training size of 10000 and held-out test data size of 2500. ... we randomly selected 20% of users who stayed active for only one quarter, included these users and those who were active for at least two quarters as our training dataset; the remaining 80% of short-term users were used as held-out dataset.
Hardware Specification	Yes	All our experiments were run on 32 core AMD Opteron 6134 @ 2.25Ghz with 252GB RAM.
Software Dependencies	No	The paper mentions 'Python', the 'multiprocessing' package, 'mov MF software provided by Banerjee et al. (2005)', and 'the scikit-learn package', but does not specify version numbers for any of these software components.
Experiment Setup	Yes	The prior parameters for all the v MF mixtures are {α, µ0,C0, m, σ2}. ... We set α = 1.0 for all models. The prior parameter µ0 is estimated using empirical Bayes. ... we set C0 = 2.0 for all the v MF mixtures. ... We found that overall, the run with 10 clusters generates more coherent clusters, so we set H = 10 for parametric models. ... Three independent chains (each chain running 20000 iterations) were used to diagnose the convergence of OTv MFMMs...