Effective Neural Topic Modeling with Embedding Clustering Regularization
Authors: Xiaobao Wu, Xinshuai Dong, Thong Thanh Nguyen, Anh Tuan Luu
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on benchmark datasets demonstrate that ECRTM effectively addresses the topic collapsing issue and consistently surpasses state-of-the-art baselines in terms of topic quality, topic distributions of documents, and downstream classification tasks. |
| Researcher Affiliation | Academia | Xiaobao Wu 1 Xinshuai Dong 2 Thong Nguyen 3 Anh Tuan Luu 1 1Nanyang Technological University 2Carnegie Mellon University 3National University of Singapore. |
| Pseudocode | Yes | Algorithm 1 Training algorithm for ECRTM. |
| Open Source Code | Yes | 1Our code is available at https://github.com/bobxwu/ECRTM |
| Open Datasets | Yes | Datasets We adopt the following benchmark document datasets for experiments: (i) 20 News Groups (20NG, Lang, 1995) is one of the most popular datasets for evaluating topic models, including news articles with 20 labels; (ii) IMDB (Maas et al., 2011) is the movie reviews containing two labels (positive and negative); (iii) Yahoo Answer (Zhang et al., 2015) is the question titles, contents, and the best answers from the Yahoo website with 10 labels, such as Society, Culture, and Family & Relationships; (iv) AG News (Zhang et al., 2015) contains news titles and descriptions, divided into 4 categories like Sports and Business. |
| Dataset Splits | No | The paper uses standard benchmark datasets, but it does not explicitly provide specific percentages, counts, or a detailed methodology for train/validation/test splits for its own experiments beyond general machine learning terminology. |
| Hardware Specification | No | No specific hardware details (e.g., exact GPU/CPU models, processor types, or memory amounts) were provided for running the experiments. |
| Software Dependencies | No | The paper mentions optimization algorithms like Adam and pre-trained embeddings like GloVe but does not provide specific software dependencies with version numbers (e.g., programming language versions, library versions like PyTorch or TensorFlow). |
| Experiment Setup | Yes | For the Sinkhorn s algorithm of ECRTM, we set the maximum number of iterations as 1,000, the stop tolerance 0.005, and ε 0.05 following Cuturi (2013). For our ECRTM, the prior distribution is specified with Laplace approximation (Hennig et al., 2012) to approximate a symmetric Dirichlet prior as µ0,k = 0 and Σ0,kk = (K 1)/(αK) with hyperparameter α. We set α as 1.0 following Card et al. (2018). Our encoder network is the same as Srivastava & Sutton (2017); Wu et al. (2020a;b): a MLP that has two linear layers with softplus activation function, concatenated with two single layers each for the mean and covariance matrix. We use Adam (Kingma & Ba, 2014) to optimize model parameters. |