Geometric Dirichlet Means Algorithm for topic inference

Authors: Mikhail Yurochkin, XuanLong Nguyen

NeurIPS 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The algorithm is evaluated with extensive experiments on simulated and real data.
Researcher Affiliation Academia Mikhail Yurochkin Department of Statistics University of Michigan moonfolk@umich.edu Xuan Long Nguyen Department of Statistics University of Michigan xuanlong@umich.edu
Pseudocode Yes Algorithm 1 Geometric Dirichlet Means (GDM)
Open Source Code No The paper does not provide an explicit statement about releasing source code for the methodology described, nor does it include a link to a code repository.
Open Datasets Yes NIPS corpora analysis We proceed with the analysis of the NIPS corpus.1 After preprocessing, there are 1738 documents and 4188 unique words. Length of documents ranges from 39 to 1403 with mean of 272. We consider K = 5, 10, 15, 20, α = 5 K , η = 0.1. For each value of K we set aside 300 documents chosen at random to compute the perplexity and average results over 3 repetitions. ... 1https://archive.ics.uci.edu/ml/datasets/Bag+of+Words
Dataset Splits Yes The number of held-out documents is 100; results are averaged over 5 repetitions." and "For each value of K we set aside 300 documents chosen at random to compute the perplexity and average results over 3 repetitions.
Hardware Specification No The paper does not provide specific hardware details (exact GPU/CPU models, processor types, or memory amounts) used for running its experiments. It only mentions runtimes without specifying the hardware they were run on.
Software Dependencies No The paper mentions 'R' and 'Python' as programming languages and the 'Hartigan & Wong (1979)' algorithm, but does not provide specific version numbers for these or any other software dependencies.
Experiment Setup Yes Unless otherwise specified, we set η = 0.1, α = 0.1, V = 1200, M = 1000, K = 5; Nm = 1000 for each m; the number of held-out documents is 100; results are averaged over 5 repetitions. Since finding exact solution to the k-means objective is NP hard, we use the algorithm of Hartigan & Wong (1979) with 10 restarts and the k-means++ initialization.