Representing Mixtures of Word Embeddings with Mixtures of Topic Embeddings

Authors: dongsheng wang, Dan dan Guo, He Zhao, Huangjie Zheng, Korawat Tanwisuth, Bo Chen, Mingyuan Zhou

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on text analysis demonstrate that the proposed method, which is amenable to mini-batch stochastic gradient descent based optimization and hence scalable to big corpora, provides competitive performance in discovering more coherent and diverse topics and extracting better document representations. We have conducted comprehensive experiments on a wide variety of datasets in the comparison with advanced BPTMs and NTMs, which show that our model can achieve the state-of-the-art performance as well as applealing interpretability.
Researcher Affiliation Academia Dongsheng Wang1, , Dandan Guo2, , He Zhao3, Huangjie Zheng4, Korawat Tanwisuth4, Bo Chen1, Mingyuan Zhou4 1Xidian University 2The Chinese University of Hong Kong, Shenzhen 3Monash University 4The University of Texas at Austin
Pseudocode Yes Algorithm 1 Training algorithm for our proposed model.
Open Source Code Yes The code is available at https://github.com/Bo Chen Group/We Te.
Open Datasets Yes To demonstrate the robustness of our We Te in terms of learning topics and document representation, we conduct the experiments on six widely-used textual data, including regular and short documents, varying in scales. The datasets include 20 News Group (20NG), DBpedia (DP) (Lehmann et al., 2015), Web Snippets (WS) (Phan et al., 2008), Tag My News (TMN) (Vitale et al., 2012), Reuters extracted from the Reuters-21578 dataset, and Reuters Corpus Volume 2 (RCV2) (Lewis et al., 2004), where WS, DP, and TMN consist of short documents. The statistics and detailed descriptions of the datasets are provided in Appendix C. 20NG2: 20 Newsgroups consists of 18,846 articles. DP3: DBpedia is a crowd-sourced dataset extracted from Wikipedia pages. WS: Web Snippets, used in Li et al. (2016) and Zhao et al. (2020), contains 12,237 web search snippets with 8 categories. TMN4: Tag My News consists of 32,597 RSS news snippets from Tag My News with 7 categories. Reuters5: A widely used corpus extracted from the Reuters-21578 dataset. RCV26: Reuters Corpus Volume 2, used in Zhao et al. (2020), consists of 804,414 documents.
Dataset Splits No The paper mentions "default training/testing division" but does not specify a validation split or how it was used in the experiments. It only explicitly mentions training and testing.
Hardware Specification Yes All experiments are performed on an Nvidia RTX 2080-Ti GPU and implemented with Py Torch.
Software Dependencies No The paper states that experiments were "implemented with Py Torch" but does not provide specific version numbers for PyTorch or any other software dependencies, which is required for reproducibility.
Experiment Setup Yes We set the number of topics as K = 100. For our encoder, we employ a neural network stacked with a 3-layer V -256-100 fully-connected layer (V is the vocabulary size), followed by a softplus layer. We set the trade-off hyperparameter as ϵ = 1.0 and batch size as 200. We use the Adam optimizer (Kingma & Ba, 2015) with learning rate 0.001.