reproducibilityindex.ai

Würstchen: An Efficient Architecture for Large-Scale Text-to-Image Diffusion Models

Authors: Pablo Pernias, Dominic Rampas, Mats Leon Richter, Christopher Pal, Marc Aubreville

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our approach also improves the quality of text-conditioned image generation based on our user preference study. The training requirements of our approach consists of 24,602 A100-GPU hours compared to Stable Diffusion 2.1 s 200,000 GPU hours. Our approach also requires less training data to achieve these results. Furthermore, our compact latent representations allows us to perform inference over twice as fast, slashing the usual costs and carbon footprint of a state-of-the-art (SOTA) diffusion model signiﬁcantly, without compromising the end performance. In a broader comparison against SOTA models our approach is substantially more efﬁcient and compares favourably in terms of image quality. We believe that this work motivates more emphasis on the prioritization of both performance and computational accessibility.
Researcher Affiliation	Collaboration	Pablo Pern ıas LAION e.V. Dominic Rampas Technische Hochschule Ingolstadt Wand Technologies Inc., LAION e.V. Mats L. Richter Mila, Quebec AI Institute Christopher J. Pal Polytechnique Montr eal, Canada CIFAR AI Chair Marc Aubreville Technische Hochschule Ingolstadt
Pseudocode	No	The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code	Yes	We are publicly releasing the source code and the entire suite of model weights. ... We will provide all of our source code, including training-, and inference scripts and trained models on Git Hub. ... We release the entire source code of our pipeline, together with the model weights used to generate these results in our Git Hub repository. We also include instructions on how to train the model and an inference notebook.
Open Datasets	Yes	All stages were trained on aggressively ﬁltered (approx. 103M images) subsets of the improved-aesthetic LAION-5B (Schuhmann et al., 2022) dataset. ... This work uses the LAION 5-B dataset, which is sourced from the freely available Common Crawl web index and was recently criticized as containing problematic content. ... As described in Appendix F, we only used deduplicated publicly available data to train the model.
Dataset Splits	No	The paper mentions using "COCO-validation" for evaluation and generating images from prompts randomly chosen from it, but it does not provide specific training/validation/test splits (percentages, counts, or explicit instructions) for its primary training data (LAION-5B) or for how COCO was split for their purposes beyond using its validation prompts.
Hardware Specification	Yes	The training requirements of our approach consists of 24,602 A100-GPU hours compared to Stable Diffusion 2.1 s 200,000 GPU hours. ... Inference time for 1024 1024 images on an A100-GPUs for W urstchen and three competitive approaches
Software Dependencies	No	The paper describes the use of certain models and frameworks (e.g., VQGAN, DDPM, CLIP-H, Conv NeXt-block, Efﬁcient Net2-Small) but does not provide specific version numbers for any software dependencies.
Experiment Setup	Yes	All the experiments use the standard DDPM (Ho et al., 2020) algorithm to sample latents in Stage B and C. Both stages also make use of classiﬁer-free-guidance (Ho & Salimans, 2021) with guidance scale w. We ﬁx the hyperparameters for Stage B sampling to B = 12 and w = 4, Stage C uses C = 60 for sampling. Images are generated using a 1024 1024 resolution. ... we trained an 18M parameter Stage A, a 1B parameter Stage B and a 1B parameter Stage C. We employed an Efﬁcient Net2-Small as Semantic Compressor (Tan & Le, 2019) during training. Stage B and C are conditioned on un-pooled CLIP-H (Ilharco et al., 2021) text-embeddings. The setup is designed to produce images of variable aspect ratio with up to 1538 pixels per side.