Incorporating Knowledge Graph Embeddings into Topic Modeling
Authors: Liang Yao, Yin Zhang, Baogang Wei, Zhe Jin, Rui Zhang, Yangyang Zhang, Qinfei Chen
AAAI 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our evaluation results will demonstrate the effectiveness of our method. Experimental results on three widely used datasets demonstrate that our method outperforms several stateof-the-art knowledge-based topic models and entity topic models on two tasks. |
| Researcher Affiliation | Academia | Liang Yao, Yin Zhang,* Baogang Wei, Zhe Jin, Rui Zhang, Yangyang Zhang, Qinfei Chen College of Computer Science and Technology Zhejiang University, Hangzhou, China {yaoliang, yinzh, wbg, shrineshine, iamzhangrui, zhyy, chenqinfei}@zju.edu.cn |
| Pseudocode | Yes | The generative process of KGE-LDA(a) is given as: 1. For each document d draw θd Dir(α) . 2. For each topic k in 1 . . . K: (a) Draw φk Dir(β). (b) Draw μk v MF(μ0, C0). (c) Draw κk log Normal(m, σ2). 3. For each of the Nwd words in document d: (a) Draw a topic zdn Mult(θd). (b) Draw a word wdn Mult(φzdn). 4. For each of the Ned entities in document d: (a) Draw a topic z dm Mult(θd). (b) Draw entity embedding edm v MF(μz dm, κz dm). |
| Open Source Code | Yes | We released the implementation of this paper at the first author s Git Hub: https://github.com/yao8839836/KGE-LDA. |
| Open Datasets | Yes | We run our experiments on three widely used datasets 20-Newsgroups (20NG), NIPS and the Ohsumed corpus. The 20NG dataset4 ( bydate version) contains 18,846 documents evenly categorized into 20 different categories. 11,314 documents are in the training set and 7,532 documents are in the test set. The NIPS dataset5 contains 1,740 papers from the NIPS conference. The Ohsumed corpus is from the MEDLINE database... In this study, we consider the 13,929 unique Cardiovascular diseases abstracts... 3,357 documents are in the training set and 4,043 documents are in the test set. (Footnote 4: http://qwone.com/ jason/20Newsgroups/, Footnote 5: http://www.cs.nyu.edu/ roweis/data.html, Footnote 6: http://disi.unitn.it/moschitti/corpora.htm) |
| Dataset Splits | Yes | The 20NG dataset4 ( bydate version) contains 18,846 documents evenly categorized into 20 different categories. 11,314 documents are in the training set and 7,532 documents are in the test set. The Ohsumed corpus... 3,357 documents are in the training set and 4,043 documents are in the test set. |
| Hardware Specification | No | No specific hardware details (e.g., GPU/CPU models, memory) used for experiments were mentioned. |
| Software Dependencies | No | No specific software versions (e.g., programming language, libraries, frameworks) were mentioned. |
| Experiment Setup | Yes | For all the methods in comparison, we set the hyperparameters as α = 50/K, β = 0.01, a commonly used setting which has often been employed in prior work (Steyvers and Griffiths 2007). For KGE-LDA, we initialize each dimension of μ0 with a Gaussian distribution N(0, 1) and then normalize μ0 into a unit norm vector, we set C0 = 0.01, m = 0.01, σ = 0.25... We set other parameters as the recommended settings in baseline papers, i.e., entity (concept) topic hyperparameter β = 0.01 for CI-LDA, Corr-LDA and CTM, λ = 0.6 for LF-LDA and λ = 2000, ε = 0.07 for GK-LDA. All models are trained using 1000 Gibbs sampling iterations. The only exception is that 1200 iterations (1000 initial iterations with LDA model + 200 iterations with LF-LDA) are run for LF-LDA. |