Is Automated Topic Model Evaluation Broken? The Incoherence of Coherence

Authors: Alexander Hoyle, Pranav Goel, Andrew Hian-Cheong, Denis Peskov, Jordan Boyd-Graber, Philip Resnik

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We compare automated coherence with the two most widely accepted human judgment tasks: topic rating and word intrusion. To address the standardization gap, we systematically evaluate a dominant classical model and two state-of-the-art neural models on two commonly used datasets. Automated evaluations declare a winning model when corresponding human evaluations do not, calling into question the validity of fully automatic evaluations independent of human judgments.
Researcher Affiliation Academia University of Maryland {hoyle,pgoel1,dpeskov,andrewhc,jbg,resnik}@cs.umd.edu
Pseudocode No The paper describes the algorithms and procedures but does not include any explicit pseudocode or algorithm blocks.
Open Source Code Yes we develop standardized, pre-processed versions of two widely-used English-language evaluation datasets, along with a transparent end-to-end code pipeline for reproduction of results (Section 4.1)5; github.com/ahoho/topics
Open Datasets Yes For Wikipedia, we use Wikitext-103 (WIKI, Merity et al., 2017), and for the Times, we subsample roughly 15% of documents from LDC2008T19 (NYT, Sandhaus, 2008), making it an order of magnitude larger than WIKI.
Dataset Splits Yes Val is a small held-out set of 15% of the training corpus.
Hardware Specification No The paper mentions 'dozens of GPU hours' but does not specify the model or type of GPUs used.
Software Dependencies No The paper mentions 'Spa Cy (Honnibal et al., 2020)', 'Mallet (Mc Callum, 2002)', and 'gensim (ˇReh uˇrek and Sojka, 2010)' but does not provide specific version numbers for these software dependencies.
Experiment Setup Yes We maintain a fixed computational budget per model... use a random set of 164 hyperparameter settings... train models for a variable number of steps (a hyperparameter)... For human evaluations, we select the models that maximize NPMI... Ranges for hyperparameters and other details are in Appendix A.3. ... For all models, we evaluate 164 hyperparameter settings... For Gibbs LDA, we train for 1000 iterations... For the neural models, we train for 1000 epochs, with early stopping... The Adam (Kingma and Ba, 2015) optimizer is used with a learning rate of 10−3 for D-VAE and 10−4 for ETM, and a batch size of 200.