reproducibilityindex.ai

Ordering-Sensitive and Semantic-Aware Topic Modeling

Authors: Min Yang, Tianyi Cui, Wenting Tu

AAAI 2015 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments show that our model can learn better topics and more accurate word distributions for each topic. Quantitatively, comparing to state-of-the-art topic modeling approaches, GMNTM obtains signiﬁcantly better performance in terms of perplexity, retrieval accuracy and classiﬁcation accuracy. In this section, we evaluate our model on the 20 Newsgroups and the Reuters Corpus Volume 1 (RCV1-v2) data sets. Followed the evaluation in (Srivastava, Salakhutdinov, and Hinton 2013), we compare our GMNTM model with the stateof-the-art topic models in perplexity, retrieval quality and classiﬁcation accuracy.
Researcher Affiliation	Academia	Min Yang The University of Hong Kong myang@cs.hku.hk Tianyi Cui Zhejiang University tianyicui@gmail.com Wenting Tu The University of Hong Kong wttu@cs.hku.hk
Pseudocode	Yes	The algorithm in this section is summarized in Algorithm 1. Algorithm 1 Inference Algorithm
Open Source Code	No	The paper mentions using and citing third-party toolkits (gensim, Oppenheimer, NLTK, scikit-learn) for parts of their implementation and comparisons, but it does not provide an explicit statement or link for the open-source code of their own GMNTM model.
Open Datasets	Yes	We adopt two widely used datasets, the 20 Newsgroups data and the RCV1-v2 data, in our evaluations. 1Available at http://qwone.com/~jason/20Newsgroups 2Available at http://trec.nist.gov/data/reuters/reuters.html
Dataset Splits	No	The paper specifies training and testing splits for both datasets (e.g., 'the dataset is partitioned chronologically into 11,314 training documents and 7,531 testing documents' for 20 Newsgroups), but it does not mention or specify a separate validation split.
Hardware Specification	No	The paper does not provide any specific details about the hardware (e.g., CPU, GPU models, memory) used to run the experiments.
Software Dependencies	No	The paper mentions several software components used: 'the online variational inference implementation of the gensim toolkit', 'the publicly available code for HMM', 'the NLTK toolkit (Bird 2006)', and 'the variational inference algorithm in scikitlearn toolkit (Pedregosa et al. 2011)'. However, it does not provide specific version numbers for any of these toolkits or libraries.
Experiment Setup	Yes	In our GMNTM model, the learning rate is set to 0.025 and gradually reduced to 0.0001. For each word, at most m = 6 previous words in the same sentence is used as the context. For easy comparison with other models, the word vector size is set to the same as the number of topics V = T = 128.