OTLDA: A Geometry-aware Optimal Transport Approach for Topic Modeling

Authors: Viet Huynh, He Zhao, Dinh Phung

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Comprehensive experiments illustrate that the proposed framework outperforms competitive approaches in terms of topic coherence on assorted text corpora which include both long and short documents. The representation of learned topic also leads to better accuracy on classification downstream tasks, which is considered as an extrinsic evaluation.
Researcher Affiliation Academia Viet Huynh, He Zhao, Dinh Phung Faculty of Information Technology Monash University, Australia viet.huynh, ethan.zhao,dinh.phung@monash.edu
Pseudocode Yes Algorithm 1 Optimal Transport based LDA (OTLDA)
Open Source Code No The paper mentions using default parameters from source code of other models (ETM, LFTM, WNMF) and that some code (DWL) is not publicly available, but it does not provide an explicit statement or link for the source code of its own methodology (OTLDA).
Open Datasets Yes For regular documents, we use two popular corpora including 20Newsgroups (20NG) and Wikipedia. The 20Newsgroups corpus consists of newsgroups post including approximately18, 000 documents. ... The larger Wikipedia corpus is downloaded from wikipedia.com1 including about 1.1 million documents. ... We also use two short text corpora namely 20NGshort and Twitter to demonstrate the strength of our model in the capability of modeling short texts. ... We summarize the statistics of datasets in Table 1.
Dataset Splits Yes We then use 80% for training, 10% for both validation and testing.
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU/CPU models, memory, or cloud instance types) used for running the experiments.
Software Dependencies No The paper mentions software like 'Gensim package' for LDA and 'scikit-learn' for SVC, and 'word2vec from Google', but does not specify any version numbers for these software dependencies, which are necessary for reproducibility.
Experiment Setup Yes We ran our proposed methods with several regularization terms including 0.05, 0.1, 1 50 and choose the best performance among them. We found that with regular documents, large regularizer λ, e.g 50, provide a better topic coherence while a smaller regularizer, e..g. λ = 0.05, is more suitable for short documents. ... for LDA, we use the default parameter given by the Gensim package. ... We use SVC model in scikit-learn with the default parameters to train and report the classification performance in Table 3.