reproducibilityindex.ai

OTLDA: A Geometry-aware Optimal Transport Approach for Topic Modeling

Authors: Viet Huynh, He Zhao, Dinh Phung

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Comprehensive experiments illustrate that the proposed framework outperforms competitive approaches in terms of topic coherence on assorted text corpora which include both long and short documents. The representation of learned topic also leads to better accuracy on classiﬁcation downstream tasks, which is considered as an extrinsic evaluation.
Researcher Affiliation	Academia	Viet Huynh, He Zhao, Dinh Phung Faculty of Information Technology Monash University, Australia viet.huynh, ethan.zhao,dinh.phung@monash.edu
Pseudocode	Yes	Algorithm 1 Optimal Transport based LDA (OTLDA)
Open Source Code	No	The paper mentions using default parameters from source code of other models (ETM, LFTM, WNMF) and that some code (DWL) is not publicly available, but it does not provide an explicit statement or link for the source code of its own methodology (OTLDA).
Open Datasets	Yes	For regular documents, we use two popular corpora including 20Newsgroups (20NG) and Wikipedia. The 20Newsgroups corpus consists of newsgroups post including approximately18, 000 documents. ... The larger Wikipedia corpus is downloaded from wikipedia.com1 including about 1.1 million documents. ... We also use two short text corpora namely 20NGshort and Twitter to demonstrate the strength of our model in the capability of modeling short texts. ... We summarize the statistics of datasets in Table 1.
Dataset Splits	Yes	We then use 80% for training, 10% for both validation and testing.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., GPU/CPU models, memory, or cloud instance types) used for running the experiments.
Software Dependencies	No	The paper mentions software like 'Gensim package' for LDA and 'scikit-learn' for SVC, and 'word2vec from Google', but does not specify any version numbers for these software dependencies, which are necessary for reproducibility.
Experiment Setup	Yes	We ran our proposed methods with several regularization terms including 0.05, 0.1, 1 50 and choose the best performance among them. We found that with regular documents, large regularizer λ, e.g 50, provide a better topic coherence while a smaller regularizer, e..g. λ = 0.05, is more suitable for short documents. ... for LDA, we use the default parameter given by the Gensim package. ... We use SVC model in scikit-learn with the default parameters to train and report the classiﬁcation performance in Table 3.