Putting Back the Stops: Integrating Syntax with Neural Topic Models

Authors: Mayank Nagda, Sophie Fellenz

IJCAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate on seven datasets that our proposed method effectively captures both syntactic and semantic representations of a corpus while outperforming state-of-the-art neural topic models and statistical topic models in terms of topic quality.
Researcher Affiliation Academia Mayank Nagda , Sophie Fellenz RPTU Kaiserslautern-Landau, Germany nagda@cs.uni-kl.de , fellenz@cs.uni-kl.de
Pseudocode No No structured pseudocode or algorithm blocks were found in the paper. Descriptions of processes are in paragraph form.
Open Source Code Yes 1Code and supplementary material available at: https://github.com/mayanknagda/integrating-syntax-with-neural-topic-models
Open Datasets Yes In our experiments, we utilize seven well-known datasets. The 20 Newsgroups (20NG) dataset features around 18K newsgroup posts across 20 classes [Lang, 1995]. The Amazon reviews (AR) dataset includes roughly 35M reviews spanning 18 years [Mc Auley and Leskovec, 2013]. The AG News (AGN) corpus contains over a million news articles from 2,000+ sources [Zhang et al., 2015]. The Gov Report Summaries (GR) dataset, as introduced by [Huang et al., 2021], provides summaries of about 20K government reports. The IMDB reviews (IR) dataset incorporates 50K movie reviews [Maas et al., 2011]. The Rotten Tomatoes reviews (RT) dataset presents 5,331 positive and 5,331 negative processed sentences from movie reviews [Pang and Lee, 2005]. Finally, the Yelp reviews (YR) dataset is composed of reviews from the 2015 Yelp Dataset Challenge [Zhang et al., 2015].
Dataset Splits No No specific dataset split information (exact percentages, sample counts, citations to predefined splits, or detailed splitting methodology for training, validation, and test sets) was provided in the paper.
Hardware Specification No No specific hardware details (exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running experiments were provided.
Software Dependencies No The paper mentions using "SpaCy for tokenization [Honnibal and Montani, 2017]" but does not provide a specific version number for SpaCy or any other ancillary software library.
Experiment Setup Yes Setup: In this experiment, all the selected baselines are trained for ten topics both with and without pre-processing. Sy Con NTM is trained for ten semantic topics and ten syntactic topics without any pre-processing. Setup: We benchmark Sy Con NTM against baselines using topic coherence, diversity, and quality across datasets. Adhering to standard procedures, we train all models on 50 and 200 topics five times, limiting Sy Con NTM s syntactic topics to 10.