Generative Adversarial Text to Image Synthesis

Authors: Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, Honglak Lee

ICML 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate the capability of our model to generate plausible images of birds and flowers from detailed text descriptions. We compare the GAN baseline, our GAN-CLS with image-text matching discriminator (subsection 4.2), GAN-INT learned with text manifold interpolation (subsection 4.3) and GAN-INT-CLS which combines both. Results on CUB can be seen in Figure 3. Results on the Oxford-102 Flowers dataset can be seen in Figure 4. To quantify the degree of disentangling on CUB we set up two prediction tasks with noise z as the input: pose verification and background color verification. We present results on Figure 5.
Researcher Affiliation Academia 1 University of Michigan, Ann Arbor, MI, USA (UMICH.EDU) 2 Max Planck Institute for Informatics, Saarbr ucken, Germany (MPI-INF.MPG.DE)
Pseudocode Yes Algorithm 1 GAN-CLS training algorithm with step size α, using minibatch SGD for simplicity.
Open Source Code No Our implementation was built on top of dcgan.torch2. 2https://github.com/soumith/dcgan.torch. The provided link is to a general DCGAN framework, not specific code for the methodology or modifications presented in this paper.
Open Datasets Yes We mainly use the Caltech-UCSD Birds dataset and the Oxford-102 Flowers dataset along with five text descriptions per image we collected as our evaluation setting. CUB has 11,788 images of birds belonging to one of 200 different categories. The Oxford-102 contains 8,189 images of flowers from 102 different categories.
Dataset Splits Yes As in Akata et al. (2015) and Reed et al. (2016), we split these into class-disjoint training and test sets. CUB has 150 train+val classes and 50 test classes, while Oxford-102 has 82 train+val and 20 test classes.
Hardware Specification No The paper does not specify the hardware used (e.g., GPU models, CPU types, or memory specifications) for running the experiments.
Software Dependencies No Our implementation was built on top of dcgan.torch2. This mentions a software framework but does not provide specific version numbers for it or other key software dependencies (e.g., Python, CUDA, PyTorch versions).
Experiment Setup Yes The training image size was set to 64 64 3. The text encoder produced 1, 024-dimensional embeddings that were projected to 128 dimensions in both the generator and discriminator. We used the same base learning rate of 0.0002, and used the ADAM solver (Ba & Kingma, 2015) with momentum 0.5. The generator noise was sampled from a 100-dimensional unit normal distribution. We used a minibatch size of 64 and trained for 600 epochs.