Generating Interpretable Images with Controllable Structure

Authors: Scott Reed, AƤron van den Oord, Nal Kalchbrenner, Victor Bapst, Matt Botvinick, Nando de Freitas

ICLR 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We trained our model on three image data sets annotated with text and spatial structure. We establish quantitative baselines in terms of text and structureconditional pixel log-likelihood for three data sets: Caltech-UCSD Birds (CUB), MPII Human Pose (MHP), and Common Objects in Context (MS-COCO). Table 1 shows quantitative results in terms of the negative log-likelihood of image pixels conditioned on both text and structure, for all three datasets.
Researcher Affiliation Industry S. Reed, A. van den Oord, N. Kalchbrenner, V. Bapst, M. Botvinick, N. de Freitas Google Deep Mind {reedscot,avdnoord,nalk,vbapst,botvinick,nandodefreitas}@google.com
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code No The paper does not provide concrete access to source code for the methodology described.
Open Datasets Yes The MPII Human Pose dataset (MHP) has around 25K images of humans performing 410 different activities (Andriluka et al., 2014). The Caltech-UCSD Birds database (CUB) has 11,788 images in 200 species, with 10 captions per image (Wah et al., 2011). MS-COCO (Lin et al., 2014) contains 80K training images annotated with both 5 captions per image and segmentations.
Dataset Splits Yes Table 1: Textand structure-conditional negative log-likelihoods (nll) in nats/dim. Train, validation and test splits include all of the same categories but different images and associated annotations.
Hardware Specification No The paper does not provide specific hardware details used for running its experiments.
Software Dependencies No The paper mentions software components and libraries like Pixel CNN, character-CNN-GRU, and RMSprop, but does not specify their version numbers (e.g., Python, TensorFlow, PyTorch versions).
Experiment Setup Yes We trained the model on 32x32 images. The Pixel CNN module used 10 layers with 128 feature maps. The text encoder reads character-level input, applying a GRU encoder and average pooling after three convolution layers. Unlike in Reed et al. (2016a), the text encoder is trained end-to-end from scratch for conditional image modeling. We used RMSprop with a learning rate schedule starting at 1e-4 and decaying to 1e-5, trained for 200k steps with batch size of 128. We used T = 1.05 by default.