Topic Modeling via Full Dependence Mixtures
Authors: Dan Fisher, Mark Kozdoba, Shie Mannor
ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In particular, we evaluate the approach on three large datasets, Neur IPS papers, a Twitter corpus, and full English Wikipedia, with a large number of topics, and show that the approach performs comparably or better than the the standard benchmarks. |
| Researcher Affiliation | Collaboration | 1Technion, Israel Institute of Technology 2NVIDIA Research. |
| Pseudocode | Yes | Algorithm 1 Computation of c M; Algorithm 2 FDM Optimization |
| Open Source Code | Yes | A reference implementation of the algorithm is available at https://github.com/fisherd3/fdm. |
| Open Datasets | Yes | We evaluate the FDM algorithm on ... three real world datasets: the Neur IPS full papers corpus, a very large (20 million tweets) Twitter dataset that was collected via the Twitter API and the full English Wikipedia. For the semi-synthetic dataset the topic quality was measured by comparison to the ground truth topics, while for the real datasets coherence and log-likelihhod on a hold-out set was measured. ... (Neur IPSPapers Corpus, 2016)... The tweets were collected via the Tweeter API... We use the full English Wikipedia corpus, as archived on 04/2020. |
| Dataset Splits | No | The paper specifies that '20% of the documents were taken at random as a hold-out (test) set' for Neur IPS, Twitter, and Wikipedia datasets. While a test set is clearly defined, there is no explicit mention of a separate 'validation' set or how hyperparameter tuning was performed using such a split. |
| Hardware Specification | No | The paper states, 'Hardware specifications are given in the supplementary material.' However, these details are not provided within the main text of the paper. |
| Software Dependencies | No | The paper mentions using 'Adam, (Kingma & Ba, 2015)' as an optimizer and that Sparse LDA is 'implemented in the MALLET framework'. However, it does not provide specific version numbers for these or any other software components. |
| Experiment Setup | Yes | Input: B: Batch size Input: T: Number of topics... We use Adam, (Kingma & Ba, 2015), in the experiments... The synthetic documents were generated using the LDA model: ... symmetric Dirichlet with the standard concentration parameter α = 1/T, and 30 tokens were sampled... Sparse LDA was run with 4 threads... We used M = 1000 as the dimension of the random projection... The Sparse LDA algorithm was run in two modes: With the true hyperparameters, α = 1/T, corresponding to the true α of the corpus, and with topic sparsity parameter β = 1/N, a standard setting... also evaluated Sparse LDA with a modified hyperparameter α = 10/T, and same β... All algorithms were run 5 times, until convergence... All models were run to convergence. |