Topic Modeling via Full Dependence Mixtures

Authors: Dan Fisher, Mark Kozdoba, Shie Mannor

ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In particular, we evaluate the approach on three large datasets, Neur IPS papers, a Twitter corpus, and full English Wikipedia, with a large number of topics, and show that the approach performs comparably or better than the the standard benchmarks.
Researcher Affiliation Collaboration 1Technion, Israel Institute of Technology 2NVIDIA Research.
Pseudocode Yes Algorithm 1 Computation of c M; Algorithm 2 FDM Optimization
Open Source Code Yes A reference implementation of the algorithm is available at https://github.com/fisherd3/fdm.
Open Datasets Yes We evaluate the FDM algorithm on ... three real world datasets: the Neur IPS full papers corpus, a very large (20 million tweets) Twitter dataset that was collected via the Twitter API and the full English Wikipedia. For the semi-synthetic dataset the topic quality was measured by comparison to the ground truth topics, while for the real datasets coherence and log-likelihhod on a hold-out set was measured. ... (Neur IPSPapers Corpus, 2016)... The tweets were collected via the Tweeter API... We use the full English Wikipedia corpus, as archived on 04/2020.
Dataset Splits No The paper specifies that '20% of the documents were taken at random as a hold-out (test) set' for Neur IPS, Twitter, and Wikipedia datasets. While a test set is clearly defined, there is no explicit mention of a separate 'validation' set or how hyperparameter tuning was performed using such a split.
Hardware Specification No The paper states, 'Hardware specifications are given in the supplementary material.' However, these details are not provided within the main text of the paper.
Software Dependencies No The paper mentions using 'Adam, (Kingma & Ba, 2015)' as an optimizer and that Sparse LDA is 'implemented in the MALLET framework'. However, it does not provide specific version numbers for these or any other software components.
Experiment Setup Yes Input: B: Batch size Input: T: Number of topics... We use Adam, (Kingma & Ba, 2015), in the experiments... The synthetic documents were generated using the LDA model: ... symmetric Dirichlet with the standard concentration parameter α = 1/T, and 30 tokens were sampled... Sparse LDA was run with 4 threads... We used M = 1000 as the dimension of the random projection... The Sparse LDA algorithm was run in two modes: With the true hyperparameters, α = 1/T, corresponding to the true α of the corpus, and with topic sparsity parameter β = 1/N, a standard setting... also evaluated Sparse LDA with a modified hyperparameter α = 10/T, and same β... All algorithms were run 5 times, until convergence... All models were run to convergence.