Pixel Recurrent Neural Networks

Authors: AƤron van den Oord, Nal Kalchbrenner, Koray Kavukcuoglu

ICML 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We achieve log-likelihood scores on natural images that are considerably better than the previous state of the art. Our main results also provide benchmarks on the diverse Image Net dataset. Samples generated from the model appear crisp, varied and globally coherent. Next we test the models on MNIST and on CIFAR-10 and show that they obtain log-likelihood scores that are considerably better than previous results.
Researcher Affiliation Industry A aron van den Oord AVDNOORD@GOOGLE.COM Nal Kalchbrenner NALK@GOOGLE.COM Koray Kavukcuoglu KORAYK@GOOGLE.COM Google Deep Mind
Pseudocode No The paper includes mathematical equations and architectural diagrams, but it does not contain any structured pseudocode or clearly labeled algorithm blocks.
Open Source Code No The paper does not contain any statement about releasing source code or provide a link to a code repository for the described methodology.
Open Datasets Yes Although the goal of our work was to model natural images on a large scale, we also tried our model on the binary version (Salakhutdinov & Murray, 2008) of MNIST (Le Cun et al., 1998) as it is a good sanity check and there is a lot of previous art on this dataset to compare with. Next we test our models on the CIFAR-10 dataset (Krizhevsky, 2009). Although to our knowledge the are no published results on the ILSVRC Image Net dataset (Russakovsky et al., 2015) that we can compare our models with, we give our Image Net log-likelihood performance in Table 6.
Dataset Splits Yes For the Row LSTM model with a softmax output distribution we obtain 3.06 bits/dim on the CIFAR-10 validation set. When using both the residual and skip connections, we see in Table 3 that performance of the Row LSTM improves with increased depth. This holds for up to the 12 LSTM layers that we tried. Image size NLL Validation (Train) 32x32: 3.86 (3.83) 64x64: 3.63 (3.57).
Hardware Specification No The paper states: "Our models are trained on GPUs using the Torch toolbox." and "For Image Net we use as large a batch size as allowed by the GPU memory". However, it does not specify any particular GPU models (e.g., NVIDIA A100), CPU types, or other hardware details that would allow for replication of the exact experimental environment.
Software Dependencies No The paper mentions using "the Torch toolbox" for training models but does not provide specific version numbers for Torch or any other software dependencies required to replicate the experiments.
Experiment Setup Yes The learning rate schedules were manually set for every dataset to the highest values that allowed fast convergence. The batch sizes also vary for different datasets. For smaller datasets such as MNIST and CIFAR-10 we use smaller batch sizes of 16 images as this seems to regularize the models. For Image Net we use as large a batch size as allowed by the GPU memory; this corresponds to 64 images/batch for 32 32 Image Net, and 32 images/batch for 64 64 Image Net. Apart from scaling and centering the images at the input of the network, we don t use any other preprocessing or augmentation. For the multinomial loss function we use the raw pixel color values as categories. For all the Pixel RNN models, we learn the initial recurrent state of the network. For MNIST we use a Diagonal Bi LSTM with 7 layers and a value of h = 16. For CIFAR-10 the Row and Diagonal Bi LSTMs have 12 layers and a number of h = 128 units. The Pixel CNN has 15 layers and h = 128. For 32 32 Image Net we adopt a 12 layer Row LSTM with h = 384 units and for 64 64 Image Net we use a 4 layer Row LSTM with h = 512 units.