DiffEdit: Diffusion-based semantic image editing with mask guidance
Authors: Guillaume Couairon, Jakob Verbeek, Holger Schwenk, Matthieu Cord
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We quantitatively evaluate our approach and compare to prior work using images of the Image Net and COCO dataset, as well as a set of generated images. [...] In this section, we describe our experimental setup, followed by qualitative and quantitative results. |
| Researcher Affiliation | Collaboration | Guillaume Couairon, Jakob Verbeek, Holger Schwenk Meta AI {gcouairon,jjverbeek, schwenk}@meta.com Matthieu Cord Sorbonne Universit e, Valeo.ai matthieu.cord@ sorbonne-universite.fr |
| Pseudocode | No | The paper describes the DIFFEDIT framework in Section 3.2 with textual steps and a high-level illustration in Figure 2, but it does not include formal pseudocode blocks or algorithms. |
| Open Source Code | No | The paper provides links to a third-party pre-trained model (Stable Diffusion) and an unofficial implementation of a comparison method (Cross Attention Control), but not the open-source code for the proposed DIFFEDIT methodology itself. |
| Open Datasets | Yes | Datasets. We perform experiments on three datasets. First, on Image Net (Deng et al., 2009)... Second, we consider editing images generated by Imagen (Saharia et al., 2022b)... Third, we consider edits based on images and queries from the COCO (Lin et al., 2014) dataset... |
| Dataset Splits | Yes | We create a filtered version of this dataset, for which queries are structurally similar to the caption, i.e. where only a few words are changed, but the grammatical structure stays the same. We use the filtering criterion that the total number of words inserted/deleted/replaced must not exceed 25% of the total number of words in the original caption, resulting in a total of 272 queries out of 50k original queries. |
| Hardware Specification | Yes | This allows to edit images in 10 seconds on a single Quadro GP100 GPU. |
| Software Dependencies | No | The paper mentions using "latent diffusion models," "Stable Diffusion," and "VQGAN latent spaces" and provides a link to a specific Stable Diffusion model. However, it does not provide specific version numbers for core software libraries or programming languages (e.g., Python, PyTorch/TensorFlow versions). |
| Experiment Setup | Yes | In our algorithm, we use a Gaussian noise with strength 50%... The result is then rescaled to the range [0, 1], and binarized with a threshold, which we set to 0.5 by default. [...] We use 50 steps in DDIM sampling with a fixed schedule, and the encoding ratio parameter further decreases the number of updates used for our edits. [...] We also use classifier-free guidance (Ho & Salimans, 2022) with the recommended values: 5 on Image Net, 7.5 for Stable Diffusion. |