Training-Free Structured Diffusion Guidance for Compositional Text-to-Image Synthesis

Authors: Weixi Feng, Xuehai He, Tsu-Jui Fu, Varun Jampani, Arjun Reddy Akula, Pradyumna Narayana, Sugato Basu, Xin Eric Wang, William Yang Wang

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results show that our method achieves more accurate attribute binding and compositionality in the generated images. We also propose a benchmark named Attribute Binding Contrast set (ABC-6K) to measure the compositional skills of T2I models. We conduct extensive experiments and analysis to identify the causes of incorrect attribute binding, which points out future directions in improving the faithfulness and compositionality of text-to-image synthesis.
Researcher Affiliation Collaboration 1University of California, Santa Barbara, 2University of California, Santa Cruz, 3Google
Pseudocode Yes Algorithm 1 Structure Diffusion Guidance.
Open Source Code Yes We release our core codebase containing the methodology implementation, settings, benchmarks containing compositional prompts under supplementary materials.
Open Datasets No The paper states it uses the pre-trained Stable Diffusion model and evaluates on specific benchmarks (ABC-6K, CC-500) and 10K randomly sampled captions from MSCOCO. While these are test/evaluation datasets, the paper does not specify training or validation splits for a model trained by the authors, as their method is training-free guidance for an existing model.
Dataset Splits No The paper states it uses the pre-trained Stable Diffusion model and evaluates on specific benchmarks (ABC-6K, CC-500) and 10K randomly sampled captions from MSCOCO. While these are test/evaluation datasets, the paper does not specify training or validation splits for a model trained by the authors, as their method is training-free guidance for an existing model.
Hardware Specification No The paper does not provide specific details about the hardware used for running the experiments (e.g., GPU/CPU models, memory specifications).
Software Dependencies Yes Throughout the experiments, we implement our method upon Stable Diffusion v1.4.
Experiment Setup Yes Throughout the experiments, we implement our method upon Stable Diffusion v1.4. ... We fix the guidance scale to 7.5 and equally weight the key-value matrices in cross-attention layers if not otherwise specified. We do not add hand-crafted prompts such as a photo of to the text input. We use the Stanza Library (Qi et al., 2020) for constituency parsing and obtain noun phrases if not otherwise specified.