Generalized Doubly Reparameterized Gradient Estimators

Authors: Matthias Bauer, Andriy Mnih

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section we empirically evaluate the hierarchical extension of DREGs and its generalization to GDREGs, and compare them to the naive IWAE gradient estimator (labelled as IWAE) as well as STL (Roeder et al., 2017). We evaluate the proposed DREGs and GDREGs estimators on several conditional and unconditional unsupervised learning problems and find that they outperform the regular IWAE estimator.
Researcher Affiliation Industry 1Deep Mind, London, UK. Correspondence to: Matthias Bauer <msbauer@deepmind.com>, Andriy Mnih <andriy@deepmind.com>.
Pseudocode Yes We provide full derivations and a discussion of this special case in App. H as well as an example implementation in terms of (pseudo-)code in App. F. ... In Listing 1 we provide a commented example of how to implement the GDREGs estimator for the cross-entropy objective given in Eq. (21) using JAX.
Open Source Code Yes We provide full derivations and a discussion of this special case in App. H as well as an example implementation in terms of (pseudo-)code in App. F. ... In Listing 1 we provide a commented example of how to implement the GDREGs estimator for the cross-entropy objective given in Eq. (21) using JAX.
Open Datasets Yes In the remainder of this paper we consider image modelling tasks with VAEs on several standard benchmark datasets: MNIST (Le Cun & Cortes, 2010), Omniglot (Lake et al., 2015), and Fashion MNIST (Xiao et al., 2017).
Dataset Splits No The paper states 'We split the data into training and test sets as in previous work' but does not provide specific percentages for training, validation, or test sets, nor does it explicitly mention a validation split.
Hardware Specification No No specific hardware details (e.g., GPU/CPU models, memory) used for running the experiments are mentioned in the paper.
Software Dependencies No The paper mentions using automatic differentiation frameworks like TensorFlow and JAX, but does not provide specific version numbers for these software dependencies or any other relevant libraries.
Experiment Setup Yes Unless stated otherwise, we train all models for 1000 epochs using the Adam optimizer (Kingma & Ba, 2015) with default learning rate of 3 10 4, a batch size of 64, and K = 64 importance samples; see App. G for details. In Appendix G, it is further stated: 'We use the Adam optimizer with a learning rate of 3e-4, β1 = 0.9, β2 = 0.999, and ϵ = 1e-7. All latent spaces have 50 dimensions. Every conditional distribution in Eq. (23) is parameterized by an MLP with two hidden layers of 300 tanh units each.'