reproducibilityindex.ai

Incorporating Knowledge Graph Embeddings into Topic Modeling

Authors: Liang Yao, Yin Zhang, Baogang Wei, Zhe Jin, Rui Zhang, Yangyang Zhang, Qinfei Chen

AAAI 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our evaluation results will demonstrate the effectiveness of our method. Experimental results on three widely used datasets demonstrate that our method outperforms several stateof-the-art knowledge-based topic models and entity topic models on two tasks.
Researcher Affiliation	Academia	Liang Yao, Yin Zhang,* Baogang Wei, Zhe Jin, Rui Zhang, Yangyang Zhang, Qinfei Chen College of Computer Science and Technology Zhejiang University, Hangzhou, China {yaoliang, yinzh, wbg, shrineshine, iamzhangrui, zhyy, chenqinfei}@zju.edu.cn
Pseudocode	Yes	The generative process of KGE-LDA(a) is given as: 1. For each document d draw θd Dir(α) . 2. For each topic k in 1 . . . K: (a) Draw φk Dir(β). (b) Draw μk v MF(μ0, C0). (c) Draw κk log Normal(m, σ2). 3. For each of the Nwd words in document d: (a) Draw a topic zdn Mult(θd). (b) Draw a word wdn Mult(φzdn). 4. For each of the Ned entities in document d: (a) Draw a topic z dm Mult(θd). (b) Draw entity embedding edm v MF(μz dm, κz dm).
Open Source Code	Yes	We released the implementation of this paper at the ﬁrst author s Git Hub: https://github.com/yao8839836/KGE-LDA.
Open Datasets	Yes	We run our experiments on three widely used datasets 20-Newsgroups (20NG), NIPS and the Ohsumed corpus. The 20NG dataset4 ( bydate version) contains 18,846 documents evenly categorized into 20 different categories. 11,314 documents are in the training set and 7,532 documents are in the test set. The NIPS dataset5 contains 1,740 papers from the NIPS conference. The Ohsumed corpus is from the MEDLINE database... In this study, we consider the 13,929 unique Cardiovascular diseases abstracts... 3,357 documents are in the training set and 4,043 documents are in the test set. (Footnote 4: http://qwone.com/ jason/20Newsgroups/, Footnote 5: http://www.cs.nyu.edu/ roweis/data.html, Footnote 6: http://disi.unitn.it/moschitti/corpora.htm)
Dataset Splits	Yes	The 20NG dataset4 ( bydate version) contains 18,846 documents evenly categorized into 20 different categories. 11,314 documents are in the training set and 7,532 documents are in the test set. The Ohsumed corpus... 3,357 documents are in the training set and 4,043 documents are in the test set.
Hardware Specification	No	No specific hardware details (e.g., GPU/CPU models, memory) used for experiments were mentioned.
Software Dependencies	No	No specific software versions (e.g., programming language, libraries, frameworks) were mentioned.
Experiment Setup	Yes	For all the methods in comparison, we set the hyperparameters as α = 50/K, β = 0.01, a commonly used setting which has often been employed in prior work (Steyvers and Grifﬁths 2007). For KGE-LDA, we initialize each dimension of μ0 with a Gaussian distribution N(0, 1) and then normalize μ0 into a unit norm vector, we set C0 = 0.01, m = 0.01, σ = 0.25... We set other parameters as the recommended settings in baseline papers, i.e., entity (concept) topic hyperparameter β = 0.01 for CI-LDA, Corr-LDA and CTM, λ = 0.6 for LF-LDA and λ = 2000, ε = 0.07 for GK-LDA. All models are trained using 1000 Gibbs sampling iterations. The only exception is that 1200 iterations (1000 initial iterations with LDA model + 200 iterations with LF-LDA) are run for LF-LDA.