Scaling Rectified Flow Transformers for High-Resolution Image Synthesis
Authors: Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion English, Robin Rombach
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through a large-scale study, we demonstrate the superior performance of this approach compared to established diffusion formulations for high-resolution text-to-image synthesis. [...] We train models on Image Net (Russakovsky et al., 2014) and CC12M (Changpinyo et al., 2021), and evaluate both the training and the EMA weights of the models during training using validation losses, CLIP scores (Radford et al., 2021; Hessel et al., 2021), and FID (Heusel et al., 2017) under different sampler settings. |
| Researcher Affiliation | Industry | Patrick Esser * Sumith Kulal Andreas Blattmann Rahim Entezari Jonas M uller Harry Saini Yam Levi Dominik Lorenz Axel Sauer Frederic Boesel Dustin Podell Tim Dockhorn Zion English Robin Rombach * Stability AI |
| Pseudocode | Yes | Algorithm 1 Finding Duplicate Items in a Cluster [...] Algorithm 2 Detecting Memorization in Generated Images |
| Open Source Code | Yes | The core contributions of our work are: [...] We make results, code, and model weights publicly available. |
| Open Datasets | Yes | We train models on Image Net (Russakovsky et al., 2014) and CC12M (Changpinyo et al., 2021)... |
| Dataset Splits | Yes | We train models on Image Net (Russakovsky et al., 2014) and CC12M (Changpinyo et al., 2021), and evaluate both the training and the EMA weights of the models during training using validation losses, CLIP scores (Radford et al., 2021; Hessel et al., 2021), and FID (Heusel et al., 2017) under different sampler settings. [...] All metrics are evaluated on the COCO-2014 validation split (Lin et al., 2014). |
| Hardware Specification | No | The paper mentions "GPU" and "bf16-mixed precision" (which implies certain hardware capabilities) but does not specify exact GPU models (e.g., NVIDIA A100) or CPU details used for experiments. |
| Software Dependencies | No | The paper mentions "Adam W optimizer (Loshchilov & Hutter, 2017)" and "autofaiss (2023)" but does not provide specific version numbers for these software components or other libraries used for the experiments. |
| Experiment Setup | Yes | In this experiment, we train all models using a global batch size of 1024 using the Adam W optimizer (Loshchilov & Hutter, 2017) with a learning rate of 10-4 and 1000 linear warmup steps. We use mixed-precision training and keep a copy of the model weights which gets updated every 100 training batches with an exponential moving average (EMA) using a decay factor of 0.99. |