Adapting Diffusion Models for Improved Prompt Compliance and Controllable Image Synthesis

Authors: Deepak Sridhar, Abhishek Peri, Rohith Rachala, Nuno Vasconcelos

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Recent advances in generative modeling with diffusion processes (DPs) enabled breakthroughs in image synthesis. Despite impressive image quality, these models have various prompt compliance problems, including low recall in generating multiple objects, difficulty in generating text in images, and meeting constraints like object locations and pose. For fine-grained editing and manipulation, they also require fine-grained semantic or instance maps that are tedious to produce manually. While prompt compliance can be enhanced by addition of loss functions at inference, this is time consuming and does not scale to complex scenes. To overcome these limitations, this work introduces a new family of Factor Graph Diffusion Models (FG-DMs) that models the joint distribution of images and conditioning variables, such as semantic, sketch, depth or normal maps via a factor graph decomposition. This joint structure has several advantages, including support for efficient sampling based prompt compliance schemes, which produce images of high object recall, semi-automated fine-grained editing, text-based editing of conditions with noise inversion, explainability at intermediate levels, ability to produce labeled datasets for the training of downstream models such as segmentation or depth, training with missing data, and continual learning where new conditioning variables can be added with minimal or no modifications to the existing structure. We propose an implementation of FG-DMs by adapting a pre-trained Stable Diffusion (SD) model to implement all FG-DM factors, using only COCO dataset, and show that it is effective in generating images with 15% higher recall than SD while retaining its generalization ability. We introduce an attention distillation loss that encourages consistency among the attention maps of all factors, improving the fidelity of the generated conditions and image. We also show that training FG-DMs from scratch on MM-Celeb A-HQ, Cityscapes, ADE20K, and COCO produce images of high quality (FID) and diversity (LPIPS). Project Page: FG-DM
Researcher Affiliation Academia Deepak Sridhar Abhishek Peri Rohith Rachala Nuno Vasconcelos Department of Electrical and Computer Engineering University of California, San Diego {desridha, aperi, rrachala, nvasconcelos}@ucsd.edu
Pseudocode No The paper does not contain any pseudocode or algorithm blocks.
Open Source Code Yes Project Page: FG-DM
Open Datasets Yes The pretrained SD v1.4 model is adapted using the COCO-Whole Body dataset(23; 18), with 256 input resolution, to train all condition factors... We also present results for an FG-DM trained from scratch on MM-Celeb AMask HQ (22), and for other datasets in appendix...
Dataset Splits Yes We conducted a human evaluation to compare the qualitative performance of the FGDM (adapted from SD) with N = 1 to the conventional combination of SD+CEM, where CEM is an external condition extraction model (CEM), for both segmentation and depth conditions. We collected 51 unique prompts, composed by a random subset of COCO validation prompts and a subset of creative examples. We sampled 51 (image,condition) pairs 35 pairs of (image,depth map), 16 pairs of (image,segmentation map) using the FG-DM. For SD+CEM, images were sampled with SD for the same prompts, and fed to a CEM implemented with MIDAS (3) for depth and Open Seed (57) for segmentation... Table 3: Object recall statistics for sampling FG-DM with different seeds and timesteps on the ADE20K validation set prompts.
Hardware Specification Yes All speeds are reported using a NVIDIA-A10 GPU... We train all models using 2-4 NVIDIA-A40 GPUs or 2 NVIDIA-A100 GPUs based on the availability.
Software Dependencies No The paper mentions software like 'Py QT' and libraries such as 'CLIP', 'DDIM', and 'torch-fidelity package', but it does not specify version numbers for these software components or libraries, which is necessary for a reproducible description of dependencies.
Experiment Setup Yes Table 14 summarizes the detailed hyperparameter settings of the FG-DMs trained from scratch reported in the main paper. For FG-DMs adapted from Stable Diffusion, we use the same settings as Stable Diffusion (41) and train only the adapters for 100 epochs with a learning rate of 1e-6 using Adam W optimizer.