Würstchen: An Efficient Architecture for Large-Scale Text-to-Image Diffusion Models

Authors: Pablo Pernias, Dominic Rampas, Mats Leon Richter, Christopher Pal, Marc Aubreville

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our approach also improves the quality of text-conditioned image generation based on our user preference study. The training requirements of our approach consists of 24,602 A100-GPU hours compared to Stable Diffusion 2.1 s 200,000 GPU hours. Our approach also requires less training data to achieve these results. Furthermore, our compact latent representations allows us to perform inference over twice as fast, slashing the usual costs and carbon footprint of a state-of-the-art (SOTA) diffusion model significantly, without compromising the end performance. In a broader comparison against SOTA models our approach is substantially more efficient and compares favourably in terms of image quality. We believe that this work motivates more emphasis on the prioritization of both performance and computational accessibility.
Researcher Affiliation Collaboration Pablo Pern ıas LAION e.V. Dominic Rampas Technische Hochschule Ingolstadt Wand Technologies Inc., LAION e.V. Mats L. Richter Mila, Quebec AI Institute Christopher J. Pal Polytechnique Montr eal, Canada CIFAR AI Chair Marc Aubreville Technische Hochschule Ingolstadt
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code Yes We are publicly releasing the source code and the entire suite of model weights. ... We will provide all of our source code, including training-, and inference scripts and trained models on Git Hub. ... We release the entire source code of our pipeline, together with the model weights used to generate these results in our Git Hub repository. We also include instructions on how to train the model and an inference notebook.
Open Datasets Yes All stages were trained on aggressively filtered (approx. 103M images) subsets of the improved-aesthetic LAION-5B (Schuhmann et al., 2022) dataset. ... This work uses the LAION 5-B dataset, which is sourced from the freely available Common Crawl web index and was recently criticized as containing problematic content. ... As described in Appendix F, we only used deduplicated publicly available data to train the model.
Dataset Splits No The paper mentions using "COCO-validation" for evaluation and generating images from prompts randomly chosen from it, but it does not provide specific training/validation/test splits (percentages, counts, or explicit instructions) for its primary training data (LAION-5B) or for how COCO was split for their purposes beyond using its validation prompts.
Hardware Specification Yes The training requirements of our approach consists of 24,602 A100-GPU hours compared to Stable Diffusion 2.1 s 200,000 GPU hours. ... Inference time for 1024 1024 images on an A100-GPUs for W urstchen and three competitive approaches
Software Dependencies No The paper describes the use of certain models and frameworks (e.g., VQGAN, DDPM, CLIP-H, Conv NeXt-block, Efficient Net2-Small) but does not provide specific version numbers for any software dependencies.
Experiment Setup Yes All the experiments use the standard DDPM (Ho et al., 2020) algorithm to sample latents in Stage B and C. Both stages also make use of classifier-free-guidance (Ho & Salimans, 2021) with guidance scale w. We fix the hyperparameters for Stage B sampling to B = 12 and w = 4, Stage C uses C = 60 for sampling. Images are generated using a 1024 1024 resolution. ... we trained an 18M parameter Stage A, a 1B parameter Stage B and a 1B parameter Stage C. We employed an Efficient Net2-Small as Semantic Compressor (Tan & Le, 2019) during training. Stage B and C are conditioned on un-pooled CLIP-H (Ilharco et al., 2021) text-embeddings. The setup is designed to produce images of variable aspect ratio with up to 1538 pixels per side.