GENERATING HIGH FIDELITY IMAGES WITH SUBSCALE PIXEL NETWORKS AND MULTIDIMENSIONAL UPSCALING

Authors: Jacob Menick, Nal Kalchbrenner

ICLR 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate SPNs on the unconditional generation of Celeb AHQ of size 256 and of Image Net from size 32 to 256. We achieve state-of-the-art likelihood results in multiple settings, set up new benchmark results in previously unexplored settings and are able to generate very high fidelity large scale samples on the basis of both datasets. We extensively evaluate the performance of SPN and the size and depth upscaling methods both quantitatively and from a fidelity perspective on two unconditional image generation benchmarks, Celeb AHQ-256 and Image Net of various sizes up to 256. From a MLE scores perspective, we compare with previous work to obtain state-of-the-art results on Celeb AHQ-256, both at full 8-bit resolution and at the reduced 5-bit resolution (Kingma & Dhariwal, 2018), and on Image Net-64. We also establish MLE baselines for Image Net-128 and Image Net-256. (Abstract and Section 1)
Researcher Affiliation Industry Jacob Menick Deep Mind jmenick@google.com Nal Kalchbrenner Google Brain Amsterdam nalk@google.com
Pseudocode No The paper includes architectural diagrams and mathematical equations but does not present any structured pseudocode or algorithm blocks.
Open Source Code No The paper does not state that its code is open-source, nor does it provide any links to a code repository for its specific implementation. It only mentions using an 'open source Transformer implementation' from a cited work.
Open Datasets Yes We evaluate SPNs on the unconditional generation of Celeb AHQ of size 256 and of Image Net from size 32 to 256. (Abstract) / For these experiments we use the standard ILSVRC Imagenet dataset (Kolesnikov & Lampert, 2016b) resized with Tensorflow s resize area function. (Section 4.2) / At 256 256 we can produce high-fidelity samples of celebrity faces from the Celeb AHQ dataset. (Section 4.3)
Dataset Splits No The paper refers to 'held-out data' and uses standard datasets, but it does not explicitly state specific train/validation/test splits, percentages, or sample counts used for their experiments.
Hardware Specification Yes These large batch sizes (a maximum of 2048) are achieved by increasing the degree of data parallelism by running on Google Cloud TPU pods (Jouppi et al., 2017). For Imagenet 32 we used 64 tensorcores. For Image Net 64, 128 and 256, we use 128 tensorcores. When overfitting is a problem, as in small datasets like Celeb A-HQ, we rather decrease the batch size and use a lower number of 32 tensorcores. (Appendix C)
Software Dependencies No The paper mentions using 'Tensorflow' and refers to an 'open source Transformer implementation in Vaswani et al. (2018)', but it does not provide specific version numbers for these or any other software dependencies.
Experiment Setup Yes Because our networks operate on small images (32 32 slices), we can train large networks both in terms of the number of hidden units and in terms of network depth (see Appendix C for details of sizes). The context-embedding network contains 5 convolutional layers and 6-8 self-attention layers depending on the dataset. The masked decoder consists of a Pixel CNN with 15 layers in all experiments. The 1D Transformer in the decoder (Figure 4(b)) has between 8 and 10 layers depending on the dataset. See Table 4 for all dataset-specific hyperparameter details. (Section 4) / Table 4 explicitly lists numerous hyperparameters such as batch size, learning rate, optimizer settings, and network layer details.