reproducibilityindex.ai

Effective Neural Topic Modeling with Embedding Clustering Regularization

Authors: Xiaobao Wu, Xinshuai Dong, Thong Thanh Nguyen, Anh Tuan Luu

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on benchmark datasets demonstrate that ECRTM effectively addresses the topic collapsing issue and consistently surpasses state-of-the-art baselines in terms of topic quality, topic distributions of documents, and downstream classification tasks.
Researcher Affiliation	Academia	Xiaobao Wu 1 Xinshuai Dong 2 Thong Nguyen 3 Anh Tuan Luu 1 1Nanyang Technological University 2Carnegie Mellon University 3National University of Singapore.
Pseudocode	Yes	Algorithm 1 Training algorithm for ECRTM.
Open Source Code	Yes	1Our code is available at https://github.com/bobxwu/ECRTM
Open Datasets	Yes	Datasets We adopt the following benchmark document datasets for experiments: (i) 20 News Groups (20NG, Lang, 1995) is one of the most popular datasets for evaluating topic models, including news articles with 20 labels; (ii) IMDB (Maas et al., 2011) is the movie reviews containing two labels (positive and negative); (iii) Yahoo Answer (Zhang et al., 2015) is the question titles, contents, and the best answers from the Yahoo website with 10 labels, such as Society, Culture, and Family & Relationships; (iv) AG News (Zhang et al., 2015) contains news titles and descriptions, divided into 4 categories like Sports and Business.
Dataset Splits	No	The paper uses standard benchmark datasets, but it does not explicitly provide specific percentages, counts, or a detailed methodology for train/validation/test splits for its own experiments beyond general machine learning terminology.
Hardware Specification	No	No specific hardware details (e.g., exact GPU/CPU models, processor types, or memory amounts) were provided for running the experiments.
Software Dependencies	No	The paper mentions optimization algorithms like Adam and pre-trained embeddings like GloVe but does not provide specific software dependencies with version numbers (e.g., programming language versions, library versions like PyTorch or TensorFlow).
Experiment Setup	Yes	For the Sinkhorn s algorithm of ECRTM, we set the maximum number of iterations as 1,000, the stop tolerance 0.005, and ε 0.05 following Cuturi (2013). For our ECRTM, the prior distribution is specified with Laplace approximation (Hennig et al., 2012) to approximate a symmetric Dirichlet prior as µ0,k = 0 and Σ0,kk = (K 1)/(αK) with hyperparameter α. We set α as 1.0 following Card et al. (2018). Our encoder network is the same as Srivastava & Sutton (2017); Wu et al. (2020a;b): a MLP that has two linear layers with softplus activation function, concatenated with two single layers each for the mean and covariance matrix. We use Adam (Kingma & Ba, 2014) to optimize model parameters.