StyleGAN-T: Unlocking the Power of GANs for Fast Large-Scale Text-to-Image Synthesis
Authors: Axel Sauer, Tero Karras, Samuli Laine, Andreas Geiger, Timo Aila
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our proposed model, Style GAN-T, addresses the specific requirements of large-scale text-to-image synthesis, such as large capacity, stable training on diverse datasets, strong text alignment, and controllable variation vs. text alignment tradeoff. Style GAN-T significantly improves over previous GANs and outperforms distilled diffusion models the previous state-of-the-art in fast text-to-image synthesis in terms of sample quality and speed. We use zero-shot MS COCO to compare the performance of our model to the state-of-the-art quantitatively at 64 × 64 pixel output resolution in Table 2 and 256 × 256 in Table 3. |
| Researcher Affiliation | Collaboration | 1University of T ubingen, T ubingen AI Center 2NVIDIA. |
| Pseudocode | No | The paper includes architectural diagrams (Figure 3) but does not provide any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our implementation is available at https://github.com/autonomousvision/stylegan-t. |
| Open Datasets | Yes | We train on a union of several datasets: CC12m (Changpinyo et al., 2021), CC (Sharma et al., 2018), YFCC100m (filtered) (Thomee et al., 2016; Singh et al., 2022), Redcaps (Desai et al., 2021), LAION-aesthetic-6+ (Schuhmann et al., 2022). This amounts to a total of 250M text-image pairs. |
| Dataset Splits | No | The paper states that it trains on a union of several datasets and evaluates using zero-shot MS COCO. However, it does not provide specific details on how its combined training data was split into training, validation, and test sets (e.g., percentages, sample counts, or specific file names for custom splits) for its own model development and hyperparameter tuning. While MS COCO is used for zero-shot evaluation, the internal training/validation split methodology is not detailed. |
| Hardware Specification | Yes | We have a fixed training budget of 4 weeks on 64 NVIDIA A100s available for training our final model at scale. Style GAN-T greatly narrows the quality gap between GANs and other model families while generating samples at a rate of 10 FPS on an NVIDIA A100. Generating these 56 samples at 512 × 512 takes 6 seconds on an NVIDIA RTX 3090. |
| Software Dependencies | No | The paper mentions software components like "Adam Optimizer" and "CLIP guidance" but does not specify version numbers for any programming languages, libraries (e.g., PyTorch, TensorFlow), or other software tools used in the experiments. |
| Experiment Setup | Yes | Table 4 lists the training and network architecture hyperparameters for our two configurations: lightweight (used for ablations) and the full configuration (used for main results). This includes details such as Generator channel base/max, Number of residual blocks, Generator/Text encoder parameters, Latent (z) dimension, Discriminator features, Dataset size, Number of GPUs, Batch size, Optimizer, Learning rates, Adam betas, EMA, CLIP guidance weight, and Progressive growing. Table 5 details the training schedules including phases and A100 days. |