The Generalized Reparameterization Gradient
Authors: Francisco R. Ruiz, Michalis Titsias RC AUEB, David Blei
NeurIPS 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate our approach on variational inference for two complex probabilistic models. The generalized reparameterization is effective: even a single sample from the variational distribution is enough to obtain a low-variance gradient. ... We apply g-rep to perform mean-field vi on two nonconjugate probabilistic models: the sparse gamma deep exponential family (def) and a beta-gamma matrix factorization (mf) model. ... We apply the sparse gamma def on two different databases: (i) the Olivetti database at AT&T... and (ii) the collection of papers at the Neural Information Processing Systems (nips) 2011 conference... We apply the beta-gamma mf on: (i) the binarized mnist data... and (ii) the Omniglot dataset (Lake et al., 2015). ... We show in Figure 1 the evolution of the elbo as a function of the running time for three of the considered datasets. |
| Researcher Affiliation | Academia | Francisco J. R. Ruiz University of Cambridge Columbia University Michalis K. Titsias Athens University of Economics and Business David M. Blei Columbia University |
| Pseudocode | Yes | Algorithm 1: Generalized reparameterization gradient algorithm |
| Open Source Code | No | The paper does not provide an explicit statement or link to open-source code for the methodology described. |
| Open Datasets | Yes | We apply the sparse gamma def on two different databases: (i) the Olivetti database at AT&T,6 which consists of 400 (320 for training and 80 for test) 64 64 images of human faces in a 8 bit scale (0 255); and (ii) the collection of papers at the Neural Information Processing Systems (nips) 2011 conference, which consists of 305 documents and a vocabulary of 5715 effective words in a bag-of-words format (25% of words from all documents are set aside to form the test set). ... We apply the beta-gamma mf on: (i) the binarized mnist data,7 which consists of 28 28 images of hand-written digits (we use 5000 training and 2000 test images); and (ii) the Omniglot dataset (Lake et al., 2015), which consists of 105 105 images of hand-written characters from different alphabets (we select 10 alphabets, with 4425 training images, 1475 test images, and 295 characters). ... 6http://www.cl.cam.ac.uk/research/dtg/attarchive/facedatabase.html ... 7http://yann.lecun.com/exdb/mnist |
| Dataset Splits | Yes | Olivetti database at AT&T,6 which consists of 400 (320 for training and 80 for test) ... nips 2011 conference, which consists of 305 documents... (25% of words from all documents are set aside to form the test set). ... binarized mnist data,7 which consists of ... 5000 training and 2000 test images; and (ii) the Omniglot dataset (Lake et al., 2015), which consists of ... 4425 training images, 1475 test images |
| Hardware Specification | No | The paper does not provide specific hardware details. It only mentions "CPU time". |
| Software Dependencies | No | The paper mentions software like rmsprop, Adagrad, Stan, and automatic differentiation tools but does not provide specific version numbers for any of them. |
| Experiment Setup | Yes | We use the adaptive step-size sequence proposed by Kucukelbir et al. (2016), which combines rmsprop (Tieleman and Hinton, 2012) and Adagrad (Duchi et al., 2011). Let g.i/ k be the k-th component of the gradient at the i-th iteration, and .i/ k the step-size for that component. We set .i/ k D i 0:5C C q 1 ; with s.i/ k D .g.i/ k /2 C .1 /s.i 1/ k ; where we set D 10 16, D 1, D 0:1, and we explore several values of . Thus, we update the variational parameters as v.i C1/ D v.i/ C .i/ ı rv L, where ı is the element-wise product. ... To estimate the gradient, we use 30 Monte Carlo samples for bbvi, and only 1 for advi and g-rep. For bbvi, we use Rao-Blackwellization and control variates (we use a separate set of 30 samples to estimate the control variates). For bbvi and g-rep, we use beta and gamma variational distributions, whereas advi uses Gaussian distributions on the transformed space... We parameterize the gamma distribution in terms of its shape and mean, and the beta in terms of its shape parameters and ˇ. To avoid constrained optimization, we apply the transformation v0 D log.exp.v/ 1/) to the variational parameters that are constrained to be positive and take stochastic gradient steps with respect to v0. We use the analytic gradient of the entropy terms. |