SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis
Authors: Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, Robin Rombach
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | User studies demonstrate that SDXL consistently surpasses all previous versions of Stable Diffusion by a significant margin (see Fig. 1).Table 2: Conditioning on the original spatial size of the training examples improves performance on class-conditional Image Net Deng et al. (2009) on 5122 resolution. |
| Researcher Affiliation | Academia | No explicit institutional affiliations (university names, company names, or email domains) are provided within the paper's text. |
| Pseudocode | Yes | Algorithm 1 Size and crop-micro-conditioning |
| Open Source Code | No | The paper states 'With SDXL we are releasing an open model' but does not provide a direct link to a source code repository for the described methodology or an explicit statement about its availability (e.g., 'Our code is available at...'). |
| Open Datasets | Yes | We quantitatively assess the effects of this simple but effective conditioning technique by training and evaluating three LDMs on class conditional Image Net (Deng et al., 2009) at spatial size 5122 |
| Dataset Splits | No | The paper mentions evaluating against 'the full validation set' for ImageNet metrics, but it does not provide specific details on the train/validation/test splits used for their models, particularly for the internal dataset or for the ImageNet models beyond the training set size. |
| Hardware Specification | No | The paper describes training procedures (e.g., 'batchsize of 2048') but does not specify any hardware details such as GPU models, CPU types, or other computing resources used for experiments. |
| Software Dependencies | No | The paper mentions 'Py Torch (Paszke et al., 2019)' but does not provide specific version numbers for any software dependencies or libraries used in the experiments. |
| Experiment Setup | Yes | First, we pretrain a base model (see Tab. 1) on an internal dataset whose height-and width-distribution is visualized in Fig. 2 for 600 000 optimization steps at a resolution of 256 256 pixels and a batchsize of 2048, using sizeand crop-conditioning as described in Sec. 2.2. We continue training on 512 px for another 200 000 optimization steps, and finally utilize multi-aspect training (Sec. 2.3) in combination with an offset-noise (Guttenberg & Cross Labs, 2023; Lin et al., 2023) level of 0.05 to train the model on different aspect ratios (Sec. 2.3, App. H) of 1024 1024 pixel area. |