reproducibilityindex.ai

The Generalized Reparameterization Gradient

Authors: Francisco R. Ruiz, Michalis Titsias RC AUEB, David Blei

NeurIPS 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate our approach on variational inference for two complex probabilistic models. The generalized reparameterization is eﬀective: even a single sample from the variational distribution is enough to obtain a low-variance gradient. ... We apply g-rep to perform mean-ﬁeld vi on two nonconjugate probabilistic models: the sparse gamma deep exponential family (def) and a beta-gamma matrix factorization (mf) model. ... We apply the sparse gamma def on two diﬀerent databases: (i) the Olivetti database at AT&T... and (ii) the collection of papers at the Neural Information Processing Systems (nips) 2011 conference... We apply the beta-gamma mf on: (i) the binarized mnist data... and (ii) the Omniglot dataset (Lake et al., 2015). ... We show in Figure 1 the evolution of the elbo as a function of the running time for three of the considered datasets.
Researcher Affiliation	Academia	Francisco J. R. Ruiz University of Cambridge Columbia University Michalis K. Titsias Athens University of Economics and Business David M. Blei Columbia University
Pseudocode	Yes	Algorithm 1: Generalized reparameterization gradient algorithm
Open Source Code	No	The paper does not provide an explicit statement or link to open-source code for the methodology described.
Open Datasets	Yes	We apply the sparse gamma def on two diﬀerent databases: (i) the Olivetti database at AT&T,6 which consists of 400 (320 for training and 80 for test) 64 64 images of human faces in a 8 bit scale (0 255); and (ii) the collection of papers at the Neural Information Processing Systems (nips) 2011 conference, which consists of 305 documents and a vocabulary of 5715 eﬀective words in a bag-of-words format (25% of words from all documents are set aside to form the test set). ... We apply the beta-gamma mf on: (i) the binarized mnist data,7 which consists of 28 28 images of hand-written digits (we use 5000 training and 2000 test images); and (ii) the Omniglot dataset (Lake et al., 2015), which consists of 105 105 images of hand-written characters from diﬀerent alphabets (we select 10 alphabets, with 4425 training images, 1475 test images, and 295 characters). ... 6http://www.cl.cam.ac.uk/research/dtg/attarchive/facedatabase.html ... 7http://yann.lecun.com/exdb/mnist
Dataset Splits	Yes	Olivetti database at AT&T,6 which consists of 400 (320 for training and 80 for test) ... nips 2011 conference, which consists of 305 documents... (25% of words from all documents are set aside to form the test set). ... binarized mnist data,7 which consists of ... 5000 training and 2000 test images; and (ii) the Omniglot dataset (Lake et al., 2015), which consists of ... 4425 training images, 1475 test images
Hardware Specification	No	The paper does not provide specific hardware details. It only mentions "CPU time".
Software Dependencies	No	The paper mentions software like rmsprop, Adagrad, Stan, and automatic diﬀerentiation tools but does not provide specific version numbers for any of them.
Experiment Setup	Yes	We use the adaptive step-size sequence proposed by Kucukelbir et al. (2016), which combines rmsprop (Tieleman and Hinton, 2012) and Adagrad (Duchi et al., 2011). Let g.i/ k be the k-th component of the gradient at the i-th iteration, and .i/ k the step-size for that component. We set .i/ k D i 0:5C C q 1 ; with s.i/ k D .g.i/ k /2 C .1 /s.i 1/ k ; where we set D 10 16, D 1, D 0:1, and we explore several values of . Thus, we update the variational parameters as v.i C1/ D v.i/ C .i/ ı rv L, where ı is the element-wise product. ... To estimate the gradient, we use 30 Monte Carlo samples for bbvi, and only 1 for advi and g-rep. For bbvi, we use Rao-Blackwellization and control variates (we use a separate set of 30 samples to estimate the control variates). For bbvi and g-rep, we use beta and gamma variational distributions, whereas advi uses Gaussian distributions on the transformed space... We parameterize the gamma distribution in terms of its shape and mean, and the beta in terms of its shape parameters and ˇ. To avoid constrained optimization, we apply the transformation v0 D log.exp.v/ 1/) to the variational parameters that are constrained to be positive and take stochastic gradient steps with respect to v0. We use the analytic gradient of the entropy terms.