Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

Authors: Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion English, Robin Rombach

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through a large-scale study, we demonstrate the superior performance of this approach compared to established diffusion formulations for high-resolution text-to-image synthesis. [...] We train models on Image Net (Russakovsky et al., 2014) and CC12M (Changpinyo et al., 2021), and evaluate both the training and the EMA weights of the models during training using validation losses, CLIP scores (Radford et al., 2021; Hessel et al., 2021), and FID (Heusel et al., 2017) under different sampler settings.
Researcher Affiliation Industry Patrick Esser * Sumith Kulal Andreas Blattmann Rahim Entezari Jonas M uller Harry Saini Yam Levi Dominik Lorenz Axel Sauer Frederic Boesel Dustin Podell Tim Dockhorn Zion English Robin Rombach * Stability AI
Pseudocode Yes Algorithm 1 Finding Duplicate Items in a Cluster [...] Algorithm 2 Detecting Memorization in Generated Images
Open Source Code Yes The core contributions of our work are: [...] We make results, code, and model weights publicly available.
Open Datasets Yes We train models on Image Net (Russakovsky et al., 2014) and CC12M (Changpinyo et al., 2021)...
Dataset Splits Yes We train models on Image Net (Russakovsky et al., 2014) and CC12M (Changpinyo et al., 2021), and evaluate both the training and the EMA weights of the models during training using validation losses, CLIP scores (Radford et al., 2021; Hessel et al., 2021), and FID (Heusel et al., 2017) under different sampler settings. [...] All metrics are evaluated on the COCO-2014 validation split (Lin et al., 2014).
Hardware Specification No The paper mentions "GPU" and "bf16-mixed precision" (which implies certain hardware capabilities) but does not specify exact GPU models (e.g., NVIDIA A100) or CPU details used for experiments.
Software Dependencies No The paper mentions "Adam W optimizer (Loshchilov & Hutter, 2017)" and "autofaiss (2023)" but does not provide specific version numbers for these software components or other libraries used for the experiments.
Experiment Setup Yes In this experiment, we train all models using a global batch size of 1024 using the Adam W optimizer (Loshchilov & Hutter, 2017) with a learning rate of 10-4 and 1000 linear warmup steps. We use mixed-precision training and keep a copy of the model weights which gets updated every 100 training batches with an exponential moving average (EMA) using a decay factor of 0.99.