Graph-Sparse LDA: A Topic Model with Structured Sparsity
Authors: Finale Doshi-Velez, Byron Wallace, Ryan Adams
AAAI 2015 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate that our Graph-Sparse LDA model finds interpretable, predictive topics on one toy example and two real-world examples from biomedical domains. In each case we compare our model with the state-of-the-art Bayesian nonparametric topic modeling approach LIDA (Archambeau, Lakshminarayanan, and Bouchard 2011). Figures 3a and 3b show the difference in the held-out test likelihoods for the final 50 samples over 20 independent instantiations of the toy problem. |
| Researcher Affiliation | Academia | Finale Doshi-Velez Harvard University Cambridge, MA 02138 finale@seas.harvard.edu Byron C Wallace University of Texas at Austin Austin, TX 78701 byron.wallace@utexas.edu Ryan Adams Harvard University Cambridge, MA 02138 rpa@seas.harvard.edu |
| Pseudocode | No | In the supplementary materials, we derive a blocked-Gibbs sampler for B, B, A, A, and P (as well as for adding and deleting topics). |
| Open Source Code | No | The paper does not provide concrete access to source code for the methodology described, nor does it explicitly state that the code is available. |
| Open Datasets | Yes | Autism Spectrum Disorder (ASD) is a complex, heterogenous disease that is often accompanied by many co-occurring conditions such as epilepsy and intellectual disability. We consider a set of 3804 patients with 3626 different diagnoses where the datum Xnw corresponds to the number of times patient n received diagnosis w during the first 15 years of life.2 Diagnoses are organized in a tree-structured hierarchy known as ICD-9CM (Bodenreider 2004). The National Library of Medicine maintains a controlled structured vocabulary of Medical Subject Headings (Me SH) (Lipscomb 2000). |
| Dataset Splits | No | A random 1% of each data-set was held out to compute predictive log-likelihoods. |
| Hardware Specification | No | The paper does not provide specific hardware details (such as exact GPU/CPU models or memory amounts) used for running its experiments. |
| Software Dependencies | No | The paper mentions the models and algorithms used (e.g., LDA, LIDA, Gibbs sampler) but does not provide specific version numbers for any software dependencies or libraries required for replication. |
| Experiment Setup | Yes | We ran all samplers for 250 iterations. To reduce burnin, The product AP was initialized using an LDA tensor decomposition (Anandkumar et al. 2012) and then factored into A and P using alternating minimization to find a sparse A that enforced the simplex and ontology constraints. |