Directed Diffusion: Direct Control of Object Placement through Attention Guidance

Authors: Wan-Duo Kurt Ma, Avisek Lahiri, J. P. Lewis, Thomas Leung, W. Bastiaan Kleijn

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Table 1 gives quantitative CLIP similarities between the embeddings of the prompts and the synthesized images. Our results are similar to or better than the compared methods. As described earlier, unmodified T2I methods such as stable diffusion are fallible and using such a system is not simply a matter of typing a text prompt.
Researcher Affiliation Collaboration Wan-Duo Kurt Ma1, Avisek Lahiri2, J.P. Lewis3*, Thomas Leung2, W. Bastiaan Kleijn1,2 1Victoria University of Wellington 2Google Research 3NVIDIA Research
Pseudocode Yes Algorithm 1: DD Cross-Attention Editing Algorithm 1: procedure DDCROSSATTNEDIT(DM( ), P, B) 2: for l layer(DM(zt, P)) do 3: if type(l) Cross Attn then 4: A = Softmax(Ql(zt) Kl(P)T ) 5: D(i) A(i) W(B )+S(B) i T I 6: a := arg mina La 7: A (|P|+1:77) := D(|P|+1:77) a 8: zt l(zt, A Vl(P)) 9: else 10: zt l(zt)
Open Source Code No The algorithm is simple to implement, requiring only a few lines of modification of a widely used library.4 4https://github.com/huggingface/diffusers
Open Datasets No The paper evaluates its method by generating images from text prompts and comparing them with other models using metrics like CLIP similarity, but it does not specify or provide access to a publicly available dataset of prompts or images used for its experimental evaluation.
Dataset Splits No The paper does not specify any training, validation, or test dataset splits. Their method operates on pre-trained models and generates images from prompts.
Hardware Specification No The paper does not explicitly describe the specific hardware (e.g., GPU/CPU models, memory details) used to run its experiments. It only mentions hardware used by other research for context.
Software Dependencies No Our method is implemented using the Python DIFFUSERS3 implementation of stable diffusion (SD) and uses the available pre-trained SD 1.5 model.3https://github.com/huggingface/diffusers
Experiment Setup Yes We chose N=10 in most of our experiments. ... wr R is a given weight, generally set to 0.1 in our experiments. ... N is generally set at 10 in our experiments (Ma et al. 2023). Large N better preserves the original foreground and background, while smaller N encourages more interaction between the foreground and background.