reproducibilityindex.ai

StyleGAN-T: Unlocking the Power of GANs for Fast Large-Scale Text-to-Image Synthesis

Authors: Axel Sauer, Tero Karras, Samuli Laine, Andreas Geiger, Timo Aila

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our proposed model, Style GAN-T, addresses the speciﬁc requirements of large-scale text-to-image synthesis, such as large capacity, stable training on diverse datasets, strong text alignment, and controllable variation vs. text alignment tradeoff. Style GAN-T signiﬁcantly improves over previous GANs and outperforms distilled diffusion models the previous state-of-the-art in fast text-to-image synthesis in terms of sample quality and speed. We use zero-shot MS COCO to compare the performance of our model to the state-of-the-art quantitatively at 64 × 64 pixel output resolution in Table 2 and 256 × 256 in Table 3.
Researcher Affiliation	Collaboration	1University of T ubingen, T ubingen AI Center 2NVIDIA.
Pseudocode	No	The paper includes architectural diagrams (Figure 3) but does not provide any structured pseudocode or algorithm blocks.
Open Source Code	Yes	Our implementation is available at https://github.com/autonomousvision/stylegan-t.
Open Datasets	Yes	We train on a union of several datasets: CC12m (Changpinyo et al., 2021), CC (Sharma et al., 2018), YFCC100m (ﬁltered) (Thomee et al., 2016; Singh et al., 2022), Redcaps (Desai et al., 2021), LAION-aesthetic-6+ (Schuhmann et al., 2022). This amounts to a total of 250M text-image pairs.
Dataset Splits	No	The paper states that it trains on a union of several datasets and evaluates using zero-shot MS COCO. However, it does not provide specific details on how its combined training data was split into training, validation, and test sets (e.g., percentages, sample counts, or specific file names for custom splits) for its own model development and hyperparameter tuning. While MS COCO is used for zero-shot evaluation, the internal training/validation split methodology is not detailed.
Hardware Specification	Yes	We have a ﬁxed training budget of 4 weeks on 64 NVIDIA A100s available for training our ﬁnal model at scale. Style GAN-T greatly narrows the quality gap between GANs and other model families while generating samples at a rate of 10 FPS on an NVIDIA A100. Generating these 56 samples at 512 × 512 takes 6 seconds on an NVIDIA RTX 3090.
Software Dependencies	No	The paper mentions software components like "Adam Optimizer" and "CLIP guidance" but does not specify version numbers for any programming languages, libraries (e.g., PyTorch, TensorFlow), or other software tools used in the experiments.
Experiment Setup	Yes	Table 4 lists the training and network architecture hyperparameters for our two conﬁgurations: lightweight (used for ablations) and the full conﬁguration (used for main results). This includes details such as Generator channel base/max, Number of residual blocks, Generator/Text encoder parameters, Latent (z) dimension, Discriminator features, Dataset size, Number of GPUs, Batch size, Optimizer, Learning rates, Adam betas, EMA, CLIP guidance weight, and Progressive growing. Table 5 details the training schedules including phases and A100 days.