Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

CompareLDA: A Topic Model for Document Comparison

Authors: Maksim Tkachenko, Hady W. Lauw7112-7119

AAAI 2019 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Evaluation on several public datasets underscores the strengths of Compare LDA in modelling document comparisons. and Our experimental objective is to validate the efﬁcacy of Compare LDA in deriving topics that are well-aligned to document comparisons.
Researcher Affiliation	Academia	Maksim Tkachenko, Hady W. Lauw School of Information Systems Singapore Management University EMAIL, EMAIL
Pseudocode	No	The paper describes the generative process and model fitting steps in paragraph form, but does not include a clearly labeled 'Pseudocode' or 'Algorithm' block.
Open Source Code	No	The paper only references the implementation used for a baseline model ('We used the following implementation: https://github.com/ vietansegan/segan/'). It does not provide any statement or link for the open-source code of their proposed method, Compare LDA.
Open Datasets	Yes	For experiments, we rely on public text corpora... Wikipedia The ﬁrst is a set of three datasets constructed from Wikipedia1 pages... The second dataset is from Amazon as described in (Mc Auley, Pandey, and Leskovec 2015; Mc Auley et al. 2015). ... The third dataset contains movie reviews (Pang and Lee 2005).
Dataset Splits	No	Each dataset is split into training and testing folds in 80:20 proportion respectively. The paper does not explicitly state a separate validation split.
Hardware Specification	No	The paper does not provide any specific hardware details (such as GPU or CPU models, or cloud computing specifications) used for running its experiments.
Software Dependencies	No	The paper mentions using a specific implementation for a baseline model ('https://github.com/ vietansegan/segan/') but does not provide any specific version numbers for software dependencies or libraries used in their own methodology or experiments.
Experiment Setup	No	The paper describes the model fitting procedure and mentions varying the number of topics (default is 80), but it does not specify concrete experimental setup details such as learning rates, batch sizes, or optimizer settings typically required for full reproducibility.