Few-shot Autoregressive Density Estimation: Towards Learning to Learn Distributions

Authors: Scott Reed, Yutian Chen, Thomas Paine, Aäron van den Oord, S. M. Ali Eslami, Danilo Rezende, Oriol Vinyals, Nando de Freitas

ICLR 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this paper, we show how 1) neural attention and 2) meta learning techniques can be used in combination with autoregressive models to enable effective few-shot density estimation. Our proposed modifications to Pixel CNN result in state-of-the art few-shot density estimation on the Omniglot dataset. Furthermore, we visualize the learned attention policy and find that it learns intuitive algorithms for simple tasks such as image mirroring on Image Net and handwriting on Omniglot without supervision. Finally, we extend the model to natural images and demonstrate few-shot image generation on the Stanford Online Products dataset.
Researcher Affiliation Industry S. Reed, Y. Chen, T. Paine, A. van den Oord, S. M. A. Eslami, D. Rezende, O. Vinyals, N. de Freitas {reedscot,yutianc,tpaine}@google.com
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code No The paper does not provide concrete access to source code for the methodology described, nor does it include specific repository links or explicit code release statements.
Open Datasets Yes We trained the model on Image Net (Deng et al., 2009) images resized to 48 48 for 30K steps using RMSProp with learning rate 1e 4. We trained the model on 26 26 binarized images and a 45 5 split into training and testing character alphabets as in Bornschein et al. (2017). In this section we demonstrate results on natural images from online product listings in the Stanford Online Products Dataset (Song et al., 2016).
Dataset Splits Yes The baseline achieves 2.64 nats/dim on the training set and 2.65 on the validation set. The attention model achieves 0.89 and 0.90 nats/dim, respectively. We trained the model on 26 26 binarized images and a 45 5 split into training and testing character alphabets as in Bornschein et al. (2017).
Hardware Specification Yes Pixel CNN and Attention Pixel CNN models are also fast to train: 10K iterations with batch size 32 took under an hour using NVidia Tesla K80 GPUs.
Software Dependencies No The paper does not provide specific ancillary software details, such as library or solver names with version numbers.
Experiment Setup Yes We trained the model on Image Net (Deng et al., 2009) images resized to 48 48 for 30K steps using RMSProp with learning rate 1e 4. The network was a 16-layer Pixel CNN with 128-dimensional feature maps at each layer, with skip connections to a 256-dimensional penultimate layer before pixel prediction. It had a total of 12 layers with 24 planes each, with skip connections to a penultimate layer with 32 planes. In practice, we used α = 0.1, and the encoder had three layers of stride-2 convolutions with 3 3 kernels, followed by elementwise squaring and a sum over all dimensions. We used three scales: 8 8, 16 16 and 32 32. The base scale uses the standard Pixel CNN architecture with 12 layers and 128 planes per layer, with 512 planes in the penultimate layer. The upscaling networks use 18 layers with 128 planes each.