Generating Interpretable Images with Controllable Structure
Authors: Scott Reed, AƤron van den Oord, Nal Kalchbrenner, Victor Bapst, Matt Botvinick, Nando de Freitas
ICLR 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We trained our model on three image data sets annotated with text and spatial structure. We establish quantitative baselines in terms of text and structureconditional pixel log-likelihood for three data sets: Caltech-UCSD Birds (CUB), MPII Human Pose (MHP), and Common Objects in Context (MS-COCO). Table 1 shows quantitative results in terms of the negative log-likelihood of image pixels conditioned on both text and structure, for all three datasets. |
| Researcher Affiliation | Industry | S. Reed, A. van den Oord, N. Kalchbrenner, V. Bapst, M. Botvinick, N. de Freitas Google Deep Mind {reedscot,avdnoord,nalk,vbapst,botvinick,nandodefreitas}@google.com |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide concrete access to source code for the methodology described. |
| Open Datasets | Yes | The MPII Human Pose dataset (MHP) has around 25K images of humans performing 410 different activities (Andriluka et al., 2014). The Caltech-UCSD Birds database (CUB) has 11,788 images in 200 species, with 10 captions per image (Wah et al., 2011). MS-COCO (Lin et al., 2014) contains 80K training images annotated with both 5 captions per image and segmentations. |
| Dataset Splits | Yes | Table 1: Textand structure-conditional negative log-likelihoods (nll) in nats/dim. Train, validation and test splits include all of the same categories but different images and associated annotations. |
| Hardware Specification | No | The paper does not provide specific hardware details used for running its experiments. |
| Software Dependencies | No | The paper mentions software components and libraries like Pixel CNN, character-CNN-GRU, and RMSprop, but does not specify their version numbers (e.g., Python, TensorFlow, PyTorch versions). |
| Experiment Setup | Yes | We trained the model on 32x32 images. The Pixel CNN module used 10 layers with 128 feature maps. The text encoder reads character-level input, applying a GRU encoder and average pooling after three convolution layers. Unlike in Reed et al. (2016a), the text encoder is trained end-to-end from scratch for conditional image modeling. We used RMSprop with a learning rate schedule starting at 1e-4 and decaying to 1e-5, trained for 200k steps with batch size of 128. We used T = 1.05 by default. |