Ordering-Sensitive and Semantic-Aware Topic Modeling

Authors: Min Yang, Tianyi Cui, Wenting Tu

AAAI 2015 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments show that our model can learn better topics and more accurate word distributions for each topic. Quantitatively, comparing to state-of-the-art topic modeling approaches, GMNTM obtains significantly better performance in terms of perplexity, retrieval accuracy and classification accuracy. In this section, we evaluate our model on the 20 Newsgroups and the Reuters Corpus Volume 1 (RCV1-v2) data sets. Followed the evaluation in (Srivastava, Salakhutdinov, and Hinton 2013), we compare our GMNTM model with the stateof-the-art topic models in perplexity, retrieval quality and classification accuracy.
Researcher Affiliation Academia Min Yang The University of Hong Kong myang@cs.hku.hk Tianyi Cui Zhejiang University tianyicui@gmail.com Wenting Tu The University of Hong Kong wttu@cs.hku.hk
Pseudocode Yes The algorithm in this section is summarized in Algorithm 1. Algorithm 1 Inference Algorithm
Open Source Code No The paper mentions using and citing third-party toolkits (gensim, Oppenheimer, NLTK, scikit-learn) for parts of their implementation and comparisons, but it does not provide an explicit statement or link for the open-source code of their own GMNTM model.
Open Datasets Yes We adopt two widely used datasets, the 20 Newsgroups data and the RCV1-v2 data, in our evaluations. 1Available at http://qwone.com/~jason/20Newsgroups 2Available at http://trec.nist.gov/data/reuters/reuters.html
Dataset Splits No The paper specifies training and testing splits for both datasets (e.g., 'the dataset is partitioned chronologically into 11,314 training documents and 7,531 testing documents' for 20 Newsgroups), but it does not mention or specify a separate validation split.
Hardware Specification No The paper does not provide any specific details about the hardware (e.g., CPU, GPU models, memory) used to run the experiments.
Software Dependencies No The paper mentions several software components used: 'the online variational inference implementation of the gensim toolkit', 'the publicly available code for HMM', 'the NLTK toolkit (Bird 2006)', and 'the variational inference algorithm in scikitlearn toolkit (Pedregosa et al. 2011)'. However, it does not provide specific version numbers for any of these toolkits or libraries.
Experiment Setup Yes In our GMNTM model, the learning rate is set to 0.025 and gradually reduced to 0.0001. For each word, at most m = 6 previous words in the same sentence is used as the context. For easy comparison with other models, the word vector size is set to the same as the number of topics V = T = 128.