Improving Topic Model Stability for Effective Document Exploration
Authors: Yi Yang, Shimei Pan, Yangqiu Song, Jie Lu, Mercan Topkara
IJCAI 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this study, we investigate the stability problem in topic modeling. We first report on the experiments conducted to quantify the severity of the problem. We then propose a new learning framework to mitigate the problem by explicitly incorporating topic stability constraints in model training. We also perform user study to demonstrate the advantages of the proposed method. |
| Researcher Affiliation | Collaboration | University of Illinois at Urbana-Champaign Shimei Pan University of Maryland Baltimore County Yangqiu Song West Virginia University Jie Lu IBM T. J. Watson Research Center Mercan Topkara |
| Pseudocode | Yes | Algorithm 1: non-disruptive Topic Model Update |
| Open Source Code | No | The paper mentions using 'Mallet [Mc Callum, 2002]' and provides a URL for it, but this refers to a third-party tool and not the authors' own open-source code for their methodology. |
| Open Datasets | Yes | To quantitatively measure the stability of a topic model, we experiment with two standard datasets, 20 Newsgroup 1 and NIPS 2. We preprocess the datasets and train LDA models using Mallet [Mc Callum, 2002]. [1] http://qwone.com/~jason/20Newsgroups/ [2] https://archive.ics.uci.edu/ml/datasets/Bag+of+Words |
| Dataset Splits | No | The paper states 'We simulate the update process by splitting the dataset into two halves based on the documents timestamps.' and 'We split the articles into two halves based on their timestamps. The first half is used to train an initial topic model using LDA. Then we updated the topic model by adding the second half.' This describes data splitting for the update process but does not specify a validation set or full train/validation/test splits for model reproduction. |
| Hardware Specification | No | The paper does not provide any specific details about the hardware used for running the experiments. |
| Software Dependencies | No | The paper mentions 'train LDA models using Mallet [Mc Callum, 2002]' but does not provide a specific version number for Mallet or any other software dependencies. |
| Experiment Setup | Yes | We vary the number of new documents in D2 from zero (;) to half of D1 (|D2| = |D1|/2) and finally to the same as D1 (|D2| = |D1|). Each topic is represented by a set of N keywords (N is 10 in our experiments). for Gibbs Sampling, frequently there is no specific criterion to test the convergence of the model. Thus in practice, we often use a pre-determined iteration number (e.g., 1000). S is empirically set based on the dataset size, and in this work, we set it to be the size of D1. |