SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

Authors: Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, Robin Rombach

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental User studies demonstrate that SDXL consistently surpasses all previous versions of Stable Diffusion by a significant margin (see Fig. 1).Table 2: Conditioning on the original spatial size of the training examples improves performance on class-conditional Image Net Deng et al. (2009) on 5122 resolution.
Researcher Affiliation Academia No explicit institutional affiliations (university names, company names, or email domains) are provided within the paper's text.
Pseudocode Yes Algorithm 1 Size and crop-micro-conditioning
Open Source Code No The paper states 'With SDXL we are releasing an open model' but does not provide a direct link to a source code repository for the described methodology or an explicit statement about its availability (e.g., 'Our code is available at...').
Open Datasets Yes We quantitatively assess the effects of this simple but effective conditioning technique by training and evaluating three LDMs on class conditional Image Net (Deng et al., 2009) at spatial size 5122
Dataset Splits No The paper mentions evaluating against 'the full validation set' for ImageNet metrics, but it does not provide specific details on the train/validation/test splits used for their models, particularly for the internal dataset or for the ImageNet models beyond the training set size.
Hardware Specification No The paper describes training procedures (e.g., 'batchsize of 2048') but does not specify any hardware details such as GPU models, CPU types, or other computing resources used for experiments.
Software Dependencies No The paper mentions 'Py Torch (Paszke et al., 2019)' but does not provide specific version numbers for any software dependencies or libraries used in the experiments.
Experiment Setup Yes First, we pretrain a base model (see Tab. 1) on an internal dataset whose height-and width-distribution is visualized in Fig. 2 for 600 000 optimization steps at a resolution of 256 256 pixels and a batchsize of 2048, using sizeand crop-conditioning as described in Sec. 2.2. We continue training on 512 px for another 200 000 optimization steps, and finally utilize multi-aspect training (Sec. 2.3) in combination with an offset-noise (Guttenberg & Cross Labs, 2023; Lin et al., 2023) level of 0.05 to train the model on different aspect ratios (Sec. 2.3, App. H) of 1024 1024 pixel area.