MultiDiffusion: Fusing Diffusion Paths for Controlled Image Generation

Authors: Omer Bar-Tal, Lior Yariv, Yaron Lipman, Tali Dekel

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We thoroughly evaluate our method when applied to each task as discussed in Sec. 4. In all experiments, we used Stable Diffusion (Rombach et al., 2022), where the diffusion process is defined over a latent space I = R64 64 4, and a decoder is trained to reconstruct natural images in higher resolution [0, 1]512 512 3.
Researcher Affiliation Collaboration 1Weizmann Institute of Science 2Meta AI.
Pseudocode Yes Algorithm 1 Multi Diffusion sampling.
Open Source Code No Project page is available at https://multidiffusion.github.io. This is a project page, not an explicit statement of code release or a direct link to a code repository.
Open Datasets Yes To quantitatively evaluate our performance, we use the COCO dataset (Lin et al., 2014), which contains images with global text caption and instance masks for each object in the image.
Dataset Splits No We apply our method on a subset from the validation set, obtained by filtering examples which consists of 2 to 4 foreground objects, excluding people, and masks that occupy less than 5% of the image. The paper uses a ‘validation set’ from COCO for evaluation, but as their method doesn’t require training, it doesn’t specify train/validation splits for its own model reproducibility.
Hardware Specification No The paper does not provide specific details on the hardware used for running experiments, such as CPU or GPU models.
Software Dependencies No In all experiments, we used Stable Diffusion (Rombach et al., 2022). The paper mentions using ‘Stable Diffusion’ but does not provide specific version numbers for software dependencies or libraries.
Experiment Setup Yes We set Tinit to be 20% of the generation process (i.e., Tinit = 800). In all experiments, we used Stable Diffusion (Rombach et al., 2022), where the diffusion process is defined over a latent space I = R64 64 4, and a decoder is trained to reconstruct natural images in higher resolution [0, 1]512 512 3. Similarly, the Multi Diffusion process, Ψ is defined in the latent space J = RH W 4 and using the decoder we produce the results in the target image space [0, 1]8H 8W 3.