Deep Attentive Variational Inference

Authors: Ifigeneia Apostolopoulou, Ian Char, Elan Rosenfeld, Artur Dubrawski

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct three series of experiments. In the first set of experiments (Section 3.1.1), we apply the attentive variational path proposed in this paper on VAEs that are trained on two datasets of binary images: the dynamically binarized MNIST and OMNIGLOT. In Section 3.1.2, we investigate the effectiveness of the proposed techniques on large-scale latent spaces that are used for generating the CIFAR-10 natural images. Qualitative results are provided in Appendices E (plot of KL divergence per layer), F (visualization of attention patterns), and G (novel samples and image reconstructions). Finally, in Section 3.2 we conduct an ablation study and report the benefits of each proposed attention module separately.
Researcher Affiliation Academia Ifigeneia Apostolopoulou1,2, Ian Char1,2, Elan Rosenfeld1 & Artur Dubrawski1,2 1Machine Learning Department & 2Auton Lab, Carnegie Mellon University
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code Yes Project code: https://github.com/ifiaposto/Deep_Attentive_VI. To aid reproducibility of the results and methods presented in our paper, we made source code to reproduce the main results of the paper publicly available, including detailed instructions; see our github page: https://github.com/ifiaposto/Deep_Attentive_VI.
Open Datasets Yes We evaluate the models on two benchmark datasets: MNIST (Le Cun et al., 1998), a dataset of 28 28 images of handwritten digits, and OMNIGLOT (Lake et al., 2013), an alphabet recognition dataset of 28 28 images. For convenience, we add two zero pixels to each border of the training images. In both cases, the observations are dynamically binarized by being resampled from the normalized real values using a Bernoulli distribution after each epoch, as suggested by Burda et al. (2016), which prevents over-fitting. We use the standard splits of MNIST into 60,000 training and 10,000 test examples, and of OMNIGLOT into 24,345 training and 8,070 test examples. CIFAR-10 is a dataset of 32 32 natural images.
Dataset Splits No The paper mentions training and test splits for MNIST and OMNIGLOT, but does not explicitly provide details about a validation split for any dataset. It states 'We use the standard splits of MNIST into 60,000 training and 10,000 test examples, and of OMNIGLOT into 24,345 training and 8,070 test examples.'
Hardware Specification Yes All models are trained on 32GB V100 GPUs.
Software Dependencies No The paper mentions methods and algorithms like Layer Norm, GELU, and Adam, but does not specify any software libraries or frameworks with version numbers (e.g., PyTorch version, TensorFlow version, CUDA version).
Experiment Setup Yes For both datasets, we use a hierarchy of L = 15 variational layers. We use a Bernoulli distribution in the image decoder. For CIFAR-10, we use a hierarchy of L = 16 variational layers. We use a mixture of discretized Logistic distributions (Salimans et al., 2017) for the data distribution. We note that it helps optimization if we bound the log of the prior standard deviation such that log σp 1.0, yielding less confident prior assumptions. We also empirically find that adding Gaussian noise with σnoise = 0.001 in both the log of the prior scale log σp and the posterior scale log σq helps network s generalization. All models are trained on 32GB V100 GPUs. Model batch size / GPU 32.