Prompt-to-Prompt Image Editing with Cross-Attention Control

Authors: Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, Daniel Cohen-or

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We present our results over diverse images and prompts with different text-to-image models, demonstrating high-quality synthesis and fidelity to the edited prompts. We present a user study in table 1. We provide additional measures in the appendix (table 2) to further validate our claims. We evaluate text-image correspondence using their CLIP score, demonstrating competitive results to methods that directly optimize this metric.
Researcher Affiliation Collaboration Amir Hertz 1,2, Ron Mokady 1,2, Jay Tenenbaum 1, Kfir Aberman1, Yael Pritch1, and Daniel Cohen-Or 1,2 1 Google Research 2The Blavatnik School of Computer Science, Tel Aviv University
Pseudocode Yes Algorithm 1: Prompt-to-Prompt image editing
Open Source Code No We also demonstrate that our method operates with different text-to-image models as a backbone and we will publish our code for the public models upon acceptance.
Open Datasets No The paper mentions that the large-scale language-image models it uses (e.g., Imagen) are 'trained on extremely large language-image datasets', but it does not provide concrete access information (link, DOI, specific citation) for a public dataset *used in its own experiments* for training or evaluation. The evaluation data itself is generated using 'predefined text templates'.
Dataset Splits No The paper describes how large language-image models are trained and internally validated, but it does not specify explicit train/validation/test dataset splits for the experiments conducted in *this* paper, as its method primarily involves inference-time control rather than training a new model.
Hardware Specification No The paper mentions using Imagen, Latent Diffusion, and Stable Diffusion models but does not provide specific hardware details (e.g., GPU models, CPU types, memory amounts) used to run the experiments described in the paper.
Software Dependencies No The paper mentions various models and components such as Imagen, Latent Diffusion, Stable Diffusion, CLIP, T5 language model, DDIM, DDPM, U-Net, and VQGAN, but it does not specify particular version numbers for these or for any programming languages or libraries (e.g., Python, PyTorch).
Experiment Setup Yes Our general algorithm for controlled generation consists of performing the iterative diffusion process for both prompts simultaneously, where an attention-based manipulation is applied in each step according to the desired editing task. We fix the internal randomness since even for the same prompt, two random seeds produce drastically different outputs. To evaluate our method, we first randomly generate text-based editing examples from predefined text templates, see appendix F for more details. Source text is then fed to the Imagen model to obtain the source image. We compare our results to other text-guided editing methods: (1) VQGAN+CLIP (Crowson, 2021), (2) Text2Live (Bar-Tal et al., 2022), (3) Blended Diffusion Avrahami et al. (2022b) and (4) Glide (Nichol et al., 2021). We also consider (5) a baseline approach where we only replace the source prompt with the target prompt after 20% of diffusion steps using the same random seed. We use the Imagen (Saharia et al., 2022b) model as a backbone for most of our experiments and results. To support geometry modifications of the object, the edited region should include the silhouettes of both the original and the newly edited object, therefore, our final mask α is a union of the binary maps. Lastly, we use the mask to constrain the editing region (line 13), where denotes an element-wise multiplication. To calculate the mask at step t, we compute the average attention map M t,w (averaged over steps T, . . . , t) of the original word w and the map M t,w of the new word w . We then apply a threshold to produce binary maps, where B(x) := x > k and k = 0.3 throughout all our experiments. 32 participants answered our user study. Each was asked to evaluate 18 randomly selected Prompt-to-Prompt examples for each method. The examples were given in random order and were divided into three parts: (A) consists of 6 replacement examples using templates 1 and 2 (see appendix F). (B) consists of 6 local refinement examples using templates 3 and 4. (C) consist of 6 global refine-ment examples using templates 5 and 6.