Topic Segmentation with an Ordering-Based Topic Model

Authors: Lan Du, John Pate, Mark Johnson

AAAI 2015 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results show that our model performs significantly better than the other ordering-based topic models on nearly all the corpora that we used, and competitively with other state-of-the-art topic segmentation models on corpora that have a strong ordering regularity. In this section we compare our TMIO model to four state-of-the-art topic segmentation models, one ordering-based and four with no sense of ordering, in two text analysis tasks (topic segmentation and cross-document alignment) with two different kinds of documents.
Researcher Affiliation Academia Lan Du, John K Pate and Mark Johnson Department of Computing, Macquarie University Sydney, NSW 2109, Australia {Lan.Du, John.Pate, Mark.Johnson}@mq.edu.au
Pseudocode Yes Given a set of topics K and a set of documents, D = {1, 2, 3, . . . , D}, the full generative process can be read off from the above components as the following: 1. For each topic k {1, . . . , K}, (a) Draw word distribution φk Dirichlet V (β). (b) Draw Bernoulli parameter in the FMM, µk Beta( α K , 1). 2. For each document d {1, . . . , D}, (a) Draw a full ordering πd from the GMM with Eq (1). (b) For each topic k, draw bd,k Bernoulli(µk). (c) Generate incomplete ordering π d = πd (σ bd). (d) For each section k π d, i. Draw topic span ld,k Poiss(λ). ii. For each paragraph j in the span, let zd,j = k and generate a set of words wd,p from Discrete(φk).
Open Source Code No The source code for Bayesseg and GM were downloaded from http://groups.csail.mit.edu/rbg/code/. The source code for STSM was downloaded from http://web.science.mq.edu.au/ ldu/code.html. (This only refers to baseline models, not their own TMIO model's code.)
Open Datasets Yes We use the following sets of corpora. The first set contains four corpora (Chen et al. 2009) whose documents are assumed to exhibit the ordering regularity. Specifically, Wikicities English and Wikicities French contain Wikipedia articles about the world s 100 largest cities by population in English and French respectively, Wikielements contains 118 English Wikipedia articles about chemical elements, and Cellphones contains 100 cellphone reviews. ... The second set consists of four of Choi s data sets (Choi 2000)... The final Clinical corpus (Eisenstein and Barzilay 2008) contains 227 documents, each of which is a chapter of a medical textbook.
Dataset Splits No The paper does not explicitly provide training/validation/test dataset splits needed to reproduce the experiment for their model.
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU/CPU models, processor types, or memory amounts) used for running its experiments.
Software Dependencies No The paper mentions parameters like 'β = 0.1' and setting 'the parameter of the Poisson distribution to 1.0', and uses named models like 'Bayesseg' and 'LDASeg', but does not provide specific version numbers for any software dependencies.
Experiment Setup Yes TMIO: the GMM parameters were exactly the same as in the GM. We used a symmetric Dirichlet prior in the Dirichlet-Multinomial model, i.e., β = 0.1. We set the parameter of the Poisson distribution to 1.0 (unless otherwise stated), and set α in the FFM to the number of topics. Results are the average of the samples drawn from 10 runs.