reproducibilityindex.ai

Topic Segmentation with an Ordering-Based Topic Model

Authors: Lan Du, John Pate, Mark Johnson

AAAI 2015 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results show that our model performs signiﬁcantly better than the other ordering-based topic models on nearly all the corpora that we used, and competitively with other state-of-the-art topic segmentation models on corpora that have a strong ordering regularity. In this section we compare our TMIO model to four state-of-the-art topic segmentation models, one ordering-based and four with no sense of ordering, in two text analysis tasks (topic segmentation and cross-document alignment) with two different kinds of documents.
Researcher Affiliation	Academia	Lan Du, John K Pate and Mark Johnson Department of Computing, Macquarie University Sydney, NSW 2109, Australia {Lan.Du, John.Pate, Mark.Johnson}@mq.edu.au
Pseudocode	Yes	Given a set of topics K and a set of documents, D = {1, 2, 3, . . . , D}, the full generative process can be read off from the above components as the following: 1. For each topic k {1, . . . , K}, (a) Draw word distribution φk Dirichlet V (β). (b) Draw Bernoulli parameter in the FMM, µk Beta( α K , 1). 2. For each document d {1, . . . , D}, (a) Draw a full ordering πd from the GMM with Eq (1). (b) For each topic k, draw bd,k Bernoulli(µk). (c) Generate incomplete ordering π d = πd (σ bd). (d) For each section k π d, i. Draw topic span ld,k Poiss(λ). ii. For each paragraph j in the span, let zd,j = k and generate a set of words wd,p from Discrete(φk).
Open Source Code	No	The source code for Bayesseg and GM were downloaded from http://groups.csail.mit.edu/rbg/code/. The source code for STSM was downloaded from http://web.science.mq.edu.au/ ldu/code.html. (This only refers to baseline models, not their own TMIO model's code.)
Open Datasets	Yes	We use the following sets of corpora. The ﬁrst set contains four corpora (Chen et al. 2009) whose documents are assumed to exhibit the ordering regularity. Speciﬁcally, Wikicities English and Wikicities French contain Wikipedia articles about the world s 100 largest cities by population in English and French respectively, Wikielements contains 118 English Wikipedia articles about chemical elements, and Cellphones contains 100 cellphone reviews. ... The second set consists of four of Choi s data sets (Choi 2000)... The ﬁnal Clinical corpus (Eisenstein and Barzilay 2008) contains 227 documents, each of which is a chapter of a medical textbook.
Dataset Splits	No	The paper does not explicitly provide training/validation/test dataset splits needed to reproduce the experiment for their model.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., GPU/CPU models, processor types, or memory amounts) used for running its experiments.
Software Dependencies	No	The paper mentions parameters like 'β = 0.1' and setting 'the parameter of the Poisson distribution to 1.0', and uses named models like 'Bayesseg' and 'LDASeg', but does not provide specific version numbers for any software dependencies.
Experiment Setup	Yes	TMIO: the GMM parameters were exactly the same as in the GM. We used a symmetric Dirichlet prior in the Dirichlet-Multinomial model, i.e., β = 0.1. We set the parameter of the Poisson distribution to 1.0 (unless otherwise stated), and set α in the FFM to the number of topics. Results are the average of the samples drawn from 10 runs.