Multi-Region Text-Driven Manipulation of Diffusion Imagery
Authors: Yiming Li, Peng Zhou, Jun Sun, Yi Xu
AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To evaluate our method, we conducted a comparative analysis of the three aforementioned manipulation tasks on our constructed dataset. Considering that Prompt-to-Prompt is not directly applicable to the task, we have re-implemented and applied it to multi-region generation through independent control of multiple regions, denoted as P2P*. During experimentation, we maintained consistent random seeds across different methods, resulting in comparable outcomes. Qualitative Comparison. For different methods, the fixed random seed introduced the same input latent noise, leading to similar layouts and structures in their outcomes. Multi Diffusion results in significant structural alterations to the inherent regions during manipulation. In some instances, it even leads to notable distortions, such as unrealistic limb deformations as indicated by the yellow box in Fig. 4. It lacks the capacity to preserve inherent structures, rendering it vulnerable to regional interference. P2P* exhibits considerably better structural preservation compared to Multi Diffusion, including spatial layout and texture patterns. However, the observed discrepancies emphasize our approach s notable superiority in image coherence and preservation of the inherent components. In contrast, MRGD exhibits region awareness through selection and guidance, which enhances the model to distinguish between edited and inherent regions, thereby suppressing mutual interference between different regions. On one hand, our method preserves better identity consistency of inherent objects, as indicated by the red box in Fig.4. On the other hand, our approach produces more harmonious results at region boundaries, as shown in the blue box. Quantitative Comparison. In Tab.1, we observed a marginal difference among the three methods in CLIP similarity, with even higher scores for comparison methods. We posit that this phenomenon highlights a bias in multi-region diffusion, wherein it excessively prioritizes image-text alignment while neglecting inter-region interactions. Conversely, our approach outperforms the comparative methods in both SSIM-e, which reflects the coherence of manipulation, and SSIM-i, which gauges the preservation of inherent objectives. Quantitative results underscore the high-quality and precision of MRGD. Subsequent analyses will be conducted in conjunction with specific tasks. |
| Researcher Affiliation | Collaboration | Yiming Li1,2, Peng Zhou3, Jun Sun1, Yi Xu1,2* 1Shanghai Key Lab of Digital Media Processing and Transmission, Shanghai Jiao Tong University 2Mo E Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University 3China Mobile (Suzhou) Software Technology Co., Ltd, China {Yiming.Li, junsun, xuyi}@sjtu.edu.cn, zhoupengcv@outlook.com |
| Pseudocode | Yes | Algorithm 1: Multi-Region Guided Diffusion Input: A source input RS, a target input RT Optional for real-word inversion: A real-word image IR Output: A source image IS, a target image IT 1: z S T N(0, 1) a unit Gaussian random distribution 2: if inversion then 3: for t = 0, 1, ...T 1 do 4: z R,i t+1 = ε(z R,i t ), for each region i 5: end for 6: for t = T, T 1, ..., 1 do 7: Null-text optimize for latent bz R t and each region i z R,i t 1 bz R,i t 1 bz R,i t , i t, Pi 2 2 9: end for 10: z S T bz R T 11: end if 12: z T T z S T 13: for t = T, T 1, ..., 1 do 14: ˆAT EDIT(AS, AT ) 15: if inversion then 16: t t 17: end if 18: ˆz S,fi t 1 , ˆz T ,fi t 1 ϵθ(z S,i t , z T ,i t | i t, P i) for region i 19: L λrs LRS + λis LIP + λip LIP 20: z S,fi t 1 , z T ,fi t 1 N(µ + Σ ˆzt 1L, Σ) 21: end for 22: IS, IT = D(ˆz S,fi 0 , ˆz T ,fi 0 ) 23: return IS, IT |
| Open Source Code | Yes | Code is available at https://github.com/liyiming09/multiregion-guided-diffusion. |
| Open Datasets | No | To the best of our knowledge, there exist no standardized benchmarks for this challenging task of text-guided image manipulation. Thus, we utilized open-source models from the Stable Diffusion-Web UI community to conduct image manipulation, with a focus on a wide range of subjects including humans, vehicles, and animals. More specifically, we sourced a variety of models and prompts from the community, which we then integrated with manually designated region coordinates (x, y, h, w) to generate a collection of 61 input pairs (RS, RT ) for ensuing experimentation. The paper describes creating a dataset but does not provide concrete access information for this specific dataset. |
| Dataset Splits | No | The paper states it uses pre-trained models and evaluates on a constructed dataset of 61 input pairs, but does not specify any training, validation, or test splits for these pairs, as it's an evaluation of the manipulation framework rather than a model trained from scratch on this dataset. |
| Hardware Specification | Yes | All experiments were executed on one RTX 3090 GPU with Py Torch. |
| Software Dependencies | No | All experiments were executed on one RTX 3090 GPU with Py Torch. The paper mentions "Py Torch" but does not provide a specific version number, which is required for reproducibility. |
| Experiment Setup | Yes | Details. For the guidance model, we utilized CLIP Vi TB/16 (Radford et al. 2021). All experiments were executed on one RTX 3090 GPU with Py Torch. Additionally, we set the default parameter to λrs = 1000, λis = 2000, λip = 300, τ = 0.5. To ensure result quality and parameter consistency, we employed a diffusion step T with a DDIM-solver of 20 in all experiments. An introduced hyperparameter, denoted as Ttotal, governs the incorporation of rtotal into the generation process after Ttotal steps, serving to avert unexpected disturbances to the initial layout. The more detailed settings are reported with analysis in the appendix. |