Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

In Search of Coherence and Consensus: Measuring the Interpretability of Statistical Topics

Authors: Fred Morstatter, Huan Liu

JMLR 2017 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this work we study measures of interpretability and propose to measure topic interpretability from two perspectives: topic coherence and topic consensus. We start with an existing measure for topic coherence model precision. It evaluates coherence of a topic by introducing an intruded word and measuring how well a human subject or a crowdsourcing approach could identify the intruded word: if it is easy to identify, the topic is coherent. We then investigate how we can measure coherence comprehensively by examining dimensions of topic coherence. For the second perspective of topic interpretability, we suggest topic consensus that measures how well the results of a crowdsourcing approach matches those given categories of topics. Good topics should lead to good categories, thus, high topic consensus. Therefore, if there is low topic consensus in terms of categories, topics could be of low interpretability.
Researcher Affiliation	Academia	Fred Morstatter EMAIL University of Southern California Information Sciences Institute 4676 Admiralty Way Ste. 1001 Marina Del Rey, CA 90292 Huan Liu EMAIL Arizona State University 699 S. Mill Ave Tempe, AZ 85283
Pseudocode	No	The paper does not contain explicitly labeled pseudocode or algorithm blocks. It includes mathematical equations and descriptions of methods but not in a structured pseudocode format.
Open Source Code	No	The paper mentions using the Mallet toolkit and Amazon's Mechanical Turk platform for experiments but does not provide specific access to source code for the methodology described in this paper.
Open Datasets	Yes	The first text corpus focused upon in this study consists of 4,351 abstracts of accepted research proposals to the European Research Council.4 In the first 7 years of its existence, the European Research Council (ERC) has funded approximately 4,500 projects, 4,351 of which are used in this study. Abstracts are limited to 2,000 characters, and when a researcher submits an abstract, they are required to select one of the three scientific domains their research fits into: Life Sciences (LS), Physical Sciences (PE), or Social Sciences and Humanities (SH). These labels will be used in the crowdsourced measure we propose later. 4. http://erc.europa.eu/projects-and-results/erc-funded-projects The second text corpus used in this work consists of 258,919 news articles indexed by Yahoo News.6 Yahoo News maintains a corpus of every news article published on its site between 2015-02-01 and 2015-06-03.7 6. https://www.yahoo.com/news/ 7. https://webscope.sandbox.yahoo.com/catalog.php?datatype=r&did=75
Dataset Splits	No	The paper describes generating topics from datasets (e.g., "We run LDA on each dataset four times, with K = 10, 25, 50, 100") and mentions "held-out documents" in the context of perplexity evaluation for predictive models. However, it does not specify explicit training, validation, or test splits for the primary experiments concerning topic coherence and consensus on the ERC and Yahoo News datasets.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., CPU, GPU models, memory) used for running its experiments.
Software Dependencies	No	In this work we use the Mallet toolkit (Mc Callum, 2002) Before running LDA, we stripped the case of all words and removed stopwords according to the My SQL stopwords list.8 8. http://www.ranks.nl/stopwords All LDA runs were carried out using the Mallet toolkit (Mc Callum, 2002). The paper mentions the 'Mallet toolkit' and 'My SQL stopwords list' but does not specify version numbers for these software components.
Experiment Setup	Yes	In all experiments, we fix the hyperparameter values to α = 5.0 and β = 0.01.