Putting Back the Stops: Integrating Syntax with Neural Topic Models
Authors: Mayank Nagda, Sophie Fellenz
IJCAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate on seven datasets that our proposed method effectively captures both syntactic and semantic representations of a corpus while outperforming state-of-the-art neural topic models and statistical topic models in terms of topic quality. |
| Researcher Affiliation | Academia | Mayank Nagda , Sophie Fellenz RPTU Kaiserslautern-Landau, Germany nagda@cs.uni-kl.de , fellenz@cs.uni-kl.de |
| Pseudocode | No | No structured pseudocode or algorithm blocks were found in the paper. Descriptions of processes are in paragraph form. |
| Open Source Code | Yes | 1Code and supplementary material available at: https://github.com/mayanknagda/integrating-syntax-with-neural-topic-models |
| Open Datasets | Yes | In our experiments, we utilize seven well-known datasets. The 20 Newsgroups (20NG) dataset features around 18K newsgroup posts across 20 classes [Lang, 1995]. The Amazon reviews (AR) dataset includes roughly 35M reviews spanning 18 years [Mc Auley and Leskovec, 2013]. The AG News (AGN) corpus contains over a million news articles from 2,000+ sources [Zhang et al., 2015]. The Gov Report Summaries (GR) dataset, as introduced by [Huang et al., 2021], provides summaries of about 20K government reports. The IMDB reviews (IR) dataset incorporates 50K movie reviews [Maas et al., 2011]. The Rotten Tomatoes reviews (RT) dataset presents 5,331 positive and 5,331 negative processed sentences from movie reviews [Pang and Lee, 2005]. Finally, the Yelp reviews (YR) dataset is composed of reviews from the 2015 Yelp Dataset Challenge [Zhang et al., 2015]. |
| Dataset Splits | No | No specific dataset split information (exact percentages, sample counts, citations to predefined splits, or detailed splitting methodology for training, validation, and test sets) was provided in the paper. |
| Hardware Specification | No | No specific hardware details (exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running experiments were provided. |
| Software Dependencies | No | The paper mentions using "SpaCy for tokenization [Honnibal and Montani, 2017]" but does not provide a specific version number for SpaCy or any other ancillary software library. |
| Experiment Setup | Yes | Setup: In this experiment, all the selected baselines are trained for ten topics both with and without pre-processing. Sy Con NTM is trained for ten semantic topics and ten syntactic topics without any pre-processing. Setup: We benchmark Sy Con NTM against baselines using topic coherence, diversity, and quality across datasets. Adhering to standard procedures, we train all models on 50 and 200 topics five times, limiting Sy Con NTM s syntactic topics to 10. |