Is Automated Topic Model Evaluation Broken? The Incoherence of Coherence
Authors: Alexander Hoyle, Pranav Goel, Andrew Hian-Cheong, Denis Peskov, Jordan Boyd-Graber, Philip Resnik
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We compare automated coherence with the two most widely accepted human judgment tasks: topic rating and word intrusion. To address the standardization gap, we systematically evaluate a dominant classical model and two state-of-the-art neural models on two commonly used datasets. Automated evaluations declare a winning model when corresponding human evaluations do not, calling into question the validity of fully automatic evaluations independent of human judgments. |
| Researcher Affiliation | Academia | University of Maryland {hoyle,pgoel1,dpeskov,andrewhc,jbg,resnik}@cs.umd.edu |
| Pseudocode | No | The paper describes the algorithms and procedures but does not include any explicit pseudocode or algorithm blocks. |
| Open Source Code | Yes | we develop standardized, pre-processed versions of two widely-used English-language evaluation datasets, along with a transparent end-to-end code pipeline for reproduction of results (Section 4.1)5; github.com/ahoho/topics |
| Open Datasets | Yes | For Wikipedia, we use Wikitext-103 (WIKI, Merity et al., 2017), and for the Times, we subsample roughly 15% of documents from LDC2008T19 (NYT, Sandhaus, 2008), making it an order of magnitude larger than WIKI. |
| Dataset Splits | Yes | Val is a small held-out set of 15% of the training corpus. |
| Hardware Specification | No | The paper mentions 'dozens of GPU hours' but does not specify the model or type of GPUs used. |
| Software Dependencies | No | The paper mentions 'Spa Cy (Honnibal et al., 2020)', 'Mallet (Mc Callum, 2002)', and 'gensim (ˇReh uˇrek and Sojka, 2010)' but does not provide specific version numbers for these software dependencies. |
| Experiment Setup | Yes | We maintain a fixed computational budget per model... use a random set of 164 hyperparameter settings... train models for a variable number of steps (a hyperparameter)... For human evaluations, we select the models that maximize NPMI... Ranges for hyperparameters and other details are in Appendix A.3. ... For all models, we evaluate 164 hyperparameter settings... For Gibbs LDA, we train for 1000 iterations... For the neural models, we train for 1000 epochs, with early stopping... The Adam (Kingma and Ba, 2015) optimizer is used with a learning rate of 10−3 for D-VAE and 10−4 for ETM, and a batch size of 200. |