Improving Topic Model Stability for Effective Document Exploration

Authors: Yi Yang, Shimei Pan, Yangqiu Song, Jie Lu, Mercan Topkara

IJCAI 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this study, we investigate the stability problem in topic modeling. We first report on the experiments conducted to quantify the severity of the problem. We then propose a new learning framework to mitigate the problem by explicitly incorporating topic stability constraints in model training. We also perform user study to demonstrate the advantages of the proposed method.
Researcher Affiliation Collaboration University of Illinois at Urbana-Champaign Shimei Pan University of Maryland Baltimore County Yangqiu Song West Virginia University Jie Lu IBM T. J. Watson Research Center Mercan Topkara
Pseudocode Yes Algorithm 1: non-disruptive Topic Model Update
Open Source Code No The paper mentions using 'Mallet [Mc Callum, 2002]' and provides a URL for it, but this refers to a third-party tool and not the authors' own open-source code for their methodology.
Open Datasets Yes To quantitatively measure the stability of a topic model, we experiment with two standard datasets, 20 Newsgroup 1 and NIPS 2. We preprocess the datasets and train LDA models using Mallet [Mc Callum, 2002]. [1] http://qwone.com/~jason/20Newsgroups/ [2] https://archive.ics.uci.edu/ml/datasets/Bag+of+Words
Dataset Splits No The paper states 'We simulate the update process by splitting the dataset into two halves based on the documents timestamps.' and 'We split the articles into two halves based on their timestamps. The first half is used to train an initial topic model using LDA. Then we updated the topic model by adding the second half.' This describes data splitting for the update process but does not specify a validation set or full train/validation/test splits for model reproduction.
Hardware Specification No The paper does not provide any specific details about the hardware used for running the experiments.
Software Dependencies No The paper mentions 'train LDA models using Mallet [Mc Callum, 2002]' but does not provide a specific version number for Mallet or any other software dependencies.
Experiment Setup Yes We vary the number of new documents in D2 from zero (;) to half of D1 (|D2| = |D1|/2) and finally to the same as D1 (|D2| = |D1|). Each topic is represented by a set of N keywords (N is 10 in our experiments). for Gibbs Sampling, frequently there is no specific criterion to test the convergence of the model. Thus in practice, we often use a pre-determined iteration number (e.g., 1000). S is empirically set based on the dataset size, and in this work, we set it to be the size of D1.