Diagnosing and Improving Topic Models by Analyzing Posterior Variability

Authors: Linzi Xing, Michael Paul

AAAI 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimenting with latent Dirichlet allocation on two datasets, we propose ideas incorporating information about the posterior distributions at the topic level and at the word level.
Researcher Affiliation Academia Linzi Xing Department of Computer Science University of Colorado, Boulder, CO 80309 linzi.xing@colorado.edu Michael J. Paul Department of Information Science University of Colorado, Boulder, CO 80309 mpaul@colorado.edu
Pseudocode No The paper describes methods textually but does not include any pseudocode or clearly labeled algorithm blocks.
Open Source Code No The paper does not provide any explicit statements about open-sourcing its code or links to a code repository for the methodology described.
Open Datasets Yes We experiment with two datasets. The News corpus contains 2,243 articles from the Associated Press. The Wiki corpus contains 10,000 articles from Wikipedia.
Dataset Splits No The paper describes running Gibbs samplers with burn-in periods and sample collection, but it does not specify explicit training, validation, or test dataset splits in the conventional sense for model training and evaluation.
Hardware Specification No The paper does not provide any specific hardware details such as GPU or CPU models, memory specifications, or cloud computing resources used for running the experiments.
Software Dependencies No The paper mentions methods and platforms used (e.g., LDA, Gibbs sampling, Amazon Mechanical Turk) but does not list specific software libraries or their version numbers required for reproduction.
Experiment Setup Yes We set the number of topics to 50 for News and 100 for Wiki. We ran the Gibbs samplers for a burn-in period of 1,000 iterations, during which we also optimized the hyperparameters of the Dirichlet priors, before freezing the hyperparameters and collecting 100 samples, each separated by a 10-sample lag, running for a total of 2,000 iterations.