Automatic Variational Inference in Stan

Authors: Alp Kucukelbir, Rajesh Ranganath, Andrew Gelman, David Blei

NeurIPS 2015 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We compare advi to mcmc sampling across hierarchical generalized linear models, nonconjugate matrix factorization, and a mixture model. We train the mixture model on a quarter million images.
Researcher Affiliation Academia Alp Kucukelbir Columbia University alp@cs.columbia.edu Rajesh Ranganath Princeton University rajeshr@cs.princeton.edu Andrew Gelman Columbia University gelman@stat.columbia.edu David M. Blei Columbia University david.blei@columbia.edu
Pseudocode Yes Algorithm 1: Automatic differentiation variational inference (advi)
Open Source Code Yes We propose an automatic variational inference algorithm, automatic differentiation variational inference (advi); we implement it in Stan (code available), a probabilistic programming system.
Open Datasets Yes Here, we show how easy it is to explore new models using advi. In both models, we use the Frey Face dataset, which contains 1956 frames (28 20 pixels) of facial expressions extracted from a video sequence. We explore the imageclef dataset, which has 250 000 images [25].
Dataset Splits No The paper mentions training sets and held-out/evaluation sets (e.g., 'We use 10 000 training samples and hold out 1000 for testing'), but it does not explicitly define or use a separate 'validation' split.
Hardware Specification No The paper does not provide specific details about the hardware (e.g., CPU, GPU models, memory) used for running the experiments.
Software Dependencies Yes advi is available in Stan 2.8. See Appendix C.
Experiment Setup Yes We approximate the posterior predictive likelihood using a mc estimate. For mcmc, we plug in posterior samples. For advi, we draw samples from the posterior approximation during the optimization. We initialize advi with a draw from a standard Gaussian. We study advi with two settings of M, the number of mc samples used to estimate gradients. A single sample per iteration is sufficient; it is also the fastest. (We set M D 1 from here on.)