Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Online Trans-dimensional von Mises-Fisher Mixture Models for User Profiles
Authors: Xiangju Qin, Pádraig Cunningham, Michael Salter-Townshend
JMLR 2016 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirical results on simulated and real-world data show that the proposed Tv MF mixture models can discover more interpretable and intuitive clusters than other widely-used models, such as k-means, non-negative matrix factorization (NMF), Dirichlet process Gaussian mixture models (DP-GMM), and dynamic topic models (DTM). We further evaluate the performance of proposed models in real-world applications, such as the churn prediction task, that shows the usefulness of the features generated. |
| Researcher Affiliation | Academia | Xiangju Qin EMAIL P adraig Cunningham EMAIL School of Computer Science University College Dublin Belfield, Dublin 4, Ireland Michael Salter-Townshend EMAIL Department of Statistics University of Oxford 24-29 St Giles, Oxford, OX1 3LB, UK |
| Pseudocode | Yes | Algorithm 1: Collapsed Gibbs sampling for Tv MFMM ... Algorithm 2: Parallel sampling for latent variables Z |
| Open Source Code | No | The paper mentions using 'mov MF software provided by Banerjee et al. (2005)' and implementations 'in the scikit-learn package' for comparison, but does not provide a statement or link for the source code of their own proposed methodology. |
| Open Datasets | Yes | We used the mov MF software provided by Banerjee et al. (2005) to generate synthetic data with: a) 4 wellseparated components; b) 5 well-separated components; c) 7 not well-separated components. ... For this purpose, we parsed the May 2014 dump of English Wikipedia, collected the edit activity of all registered users, then aggregated the edit activity of each user on a quarterly basis. |
| Dataset Splits | Yes | Each of the synthetic datasets has a training size of 10000 and held-out test data size of 2500. ... we randomly selected 20% of users who stayed active for only one quarter, included these users and those who were active for at least two quarters as our training dataset; the remaining 80% of short-term users were used as held-out dataset. |
| Hardware Specification | Yes | All our experiments were run on 32 core AMD Opteron 6134 @ 2.25Ghz with 252GB RAM. |
| Software Dependencies | No | The paper mentions 'Python', the 'multiprocessing' package, 'mov MF software provided by Banerjee et al. (2005)', and 'the scikit-learn package', but does not specify version numbers for any of these software components. |
| Experiment Setup | Yes | The prior parameters for all the v MF mixtures are {α, µ0,C0, m, σ2}. ... We set α = 1.0 for all models. The prior parameter µ0 is estimated using empirical Bayes. ... we set C0 = 2.0 for all the v MF mixtures. ... We found that overall, the run with 10 clusters generates more coherent clusters, so we set H = 10 for parametric models. ... Three independent chains (each chain running 20000 iterations) were used to diagnose the convergence of OTv MFMMs... |