reproducibilityindex.ai

A Reduction for Efficient LDA Topic Reconstruction

Authors: Matteo Almanza, Flavio Chierichetti, Alessandro Panconesi, Andrea Vattani

NeurIPS 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We present a novel approach for LDA (Latent Dirichlet Allocation) topic reconstruction. The main technical idea is to show that the distribution over the documents generated by LDA can be transformed into a distribution for a much simpler generative model in which documents are generated from the same set of topics but have a much simpler structure: documents are single topic and topics are chosen uniformly at random. Furthermore, this reduction is approximation preserving, in the sense that approximate distributions the only ones we can hope to compute in practice are mapped into approximate distribution in the simpliﬁed world. This opens up the possibility of efﬁciently reconstructing LDA topics in a roundabout way. Compute an approximate document distribution from the given corpus, transform it into an approximate distribution for the single-topic world, and run a reconstruction algorithm in the uniform, single-topic world a much simpler task than direct LDA reconstruction. We show the viability of the approach by giving very simple algorithms for a generalization of two notable cases that have been studied in the literature, p-separability and matrix-like topics.
Researcher Affiliation	Collaboration	Matteo Almanza Sapienza University Rome, Italy almanza@di.uniroma1.it Flavio Chierichetti Sapienza University Rome, Italy flavio@di.uniroma1.it Alessandro Panconesi Sapienza University Rome, Italy ale@di.uniroma1.it Andrea Vattani Spiketrap San Francisco, CA, USA avattani@cs.ucsd.edu
Pseudocode	Yes	Algorithm 1 The Algorithm for reconstructing (p, 1)-separable topics. Require: K, p > 0, δ, corpus C of documents, α parameter of the symmetric LDA mixture
Open Source Code	Yes	Our implementation can be downloaded from https://github.com/matteojug/lda-sta.
Open Datasets	Yes	The family NIPS TOPICS was generated by running Gibbs sampling on the NIPS dataset (Newman, 2008). ... Finally, a fourth family of topics were GRID topics. These are the prototypical grid-like topics of sizes 7 × 7 and 5 × 5 (introduced by Grifﬁths & Steyvers (2004)); notice that these are (p, 2)-separable but not (p, 1)-separable.
Dataset Splits	No	The paper mentions generating corpora of documents with varying sizes and lengths ('n = 10^4, 10^5, 10^6 and ℓ= 2, 3, 10, 100') but does not specify how these corpora are split into training, validation, or test sets, nor does it provide percentages or counts for such splits.
Hardware Specification	Yes	Each of these algorithms was executed on the same computer: an Intel Xeon CPU E5-2650 v4, 2.20GHz, with 64Gi B of DDR4 RAM. We used a single core per algorithm.
Software Dependencies	No	The paper mentions using 'MALLET library Mc Callum (2002)' and 'Yau (2018) of the tensor-based algorithm (henceforth TENSOR)', but it does not specify concrete version numbers for these software components. It also states 'Our implementation can be downloaded from https://github.com/matteojug/lda-sta' but the specific versions of libraries used within that implementation are not detailed in the paper.
Experiment Setup	Yes	For the experiments we generated a family of k topics in various ways, for k = 10, 25, 50. The family NIPS TOPICS was generated by running Gibbs sampling on the NIPS dataset (Newman, 2008)... From each one of the set of topics speciﬁed above, we generated a corpus of n documents of length ℓ, for n = 10^4, 10^5, 10^6 and ℓ= 2, 3, 10, 100. Because of space constraints, we will only show results for n = 10^6. ... We use the popular MALLET library Mc Callum (2002), http://mallet.cs.umass.edu/, with a 200 iteration burnin period and 1000 iterations. ... Algorithm 1 simply picks the ﬁrst two words of a document and throws the rest away, seemingly a rather wasteful thing to do.