Conic Scan-and-Cover algorithms for nonparametric topic modeling
Authors: Mikhail Yurochkin, Aritra Guha, XuanLong Nguyen
NeurIPS 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We propose new algorithms for topic modeling when the number of topics is unknown. Our approach relies on an analysis of the concentration of mass and angular geometry of the topic simplex, a convex polytope constructed by taking the convex hull of vertices representing the latent topics. Our algorithms are shown in practice to have accuracy comparable to a Gibbs sampler in terms of topic estimation, which requires the number of topics be given. Moreover, they are one of the fastest among several state of the art parametric techniques.1 Statistical consistency of our estimator is established under some conditions. 5 Experimental results |
| Researcher Affiliation | Academia | Mikhail Yurochkin Department of Statistics University of Michigan moonfolk@umich.edu Aritra Guha Department of Statistics University of Michigan aritra@umich.edu Xuan Long Nguyen Department of Statistics University of Michigan xuanlong@umich.edu |
| Pseudocode | Yes | Algorithm 1 Conic Scan-and-Cover (Co SAC) ... Algorithm 2 Co SAC for documents |
| Open Source Code | Yes | 1Code is available at https://github.com/moonfolk/Geometric-Topic-Modeling. |
| Open Datasets | No | The paper mentions "NYTimes news articles" and generating synthetic data, but does not provide concrete access information (specific link, DOI, formal citation with authors/year, or clear reference to established benchmark datasets with access details) for a publicly available or open dataset. |
| Dataset Splits | No | The paper does not provide specific dataset split information (exact percentages, sample counts, citations to predefined splits, or detailed splitting methodology) needed to reproduce the data partitioning. |
| Hardware Specification | No | The paper does not provide specific hardware details (exact GPU/CPU models, processor types, or memory amounts) used for running its experiments. It only mentions general concepts like "training time". |
| Software Dependencies | No | The paper does not provide specific ancillary software details (e.g., library or solver names with version numbers) needed to replicate the experiment. |
| Experiment Setup | Yes | Remark We found the choices ω = 0.6 and R to be median of { p1 2, . . . , p M 2} to be robust in practice and agreeing with our theoretical results. ... The choice of λ is governed by results of Prop. 4. For small αk = 1/K, k, λ P(Λc) c(K 1)/K (K 1)(1 c) and for an equilateral B we can choose d such that cos(d) = q ... Our approximations were based on large K to get a sense of λ, we now make a conservative choice λ = 0.001 ... Next we compare Co SAC to per iteration quality of the Gibbs sampler trained with 500 iterations for M = 1000 and M = 5000. |