Adversarial Supervision Makes Layout-to-Image Diffusion Models Thrive
Authors: Yumeng Li, Margret Keuper, Dan Zhang, Anna Khoreva
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments show that ALDM enables layout faithfulness of the generated images, while allowing broad editability via text prompts. Moreover, we showcase its usefulness for practical applications: by synthesizing target distribution samples via text control, we improve domain generalization of semantic segmentation models by a large margin ( 12 m Io U points). |
| Researcher Affiliation | Collaboration | Yumeng Li1,2 Margret Keuper2,3 Dan Zhang1,4 Anna Khoreva1 1Bosch Center for Artificial Intelligence 2University of Mannheim 3Max Planck Institute for Informatics 4University of T ubingen |
| Pseudocode | No | The paper includes equations and block diagrams (Figure 2), but no formal pseudocode or algorithm blocks. |
| Open Source Code | No | We plan to release the code upon acceptance. |
| Open Datasets | Yes | We conducted experiments on two challenging datasets: ADE20K (Zhou et al., 2017) and Cityscapes (Cordts et al., 2016). Regarding reproducibility, our implementation is based o publicly available models Rombach et al. (2022); Xiao et al. (2018); Zhang & Agrawala (2023); Song et al. (2020) and datasets Zhou et al. (2017); Cordts et al. (2016); Sakaridis et al. (2021) and common corruptions Hendrycks & Dietterich (2018). |
| Dataset Splits | Yes | ADE20K consists of 20K training and 2K validation images, with 150 semantic classes. Cityscapes has 19 classes, whereas there are only 2975 training and 500 validation images |
| Hardware Specification | Yes | We conducted all training using 2 NVIDIA Tesla A100 GPUs. |
| Software Dependencies | No | The paper mentions "Stable Diffusion v1.5", "Control Net", "Uper Net", "DDIM sampler", "BLIP", "m PLUG", "HRNet", "SegFormer", and "Adam W optimizer", but only provides a specific version for Stable Diffusion v1.5. It does not provide version numbers for other key software components. |
| Experiment Setup | Yes | We finetune Stable Diffusion v1.5 checkpoint and adopt Control Net for the layout conditioning. All trainings are conducted on 512 512 resolution. For Cityscapes, we do random cropping and for ADE20K we directly resize the images. Nevertheless, we directly synthesize 512 1024 Cityscapes images for evaluation. We use Adam W optimizer and the learning rate of 1 10 5 for the diffusion model, 1 10 6 for the discriminator, and the batch size of 8. The adversarial loss weighting factor λadv is set to be 0.1. The discriminator is firstly warmed up for 5K iterations on Cityscapes and 10K iterations on ADE20K. Afterward, we jointly train the diffusion model and discriminator in an adversarial manner. In the unrolling strategy, we use K = 9 as the moving horizon. An ablation study on the choice of K is provided in Table 5. Considering the computing overhead, we apply unrolling every 8 optimization steps. |