BARET: Balanced Attention Based Real Image Editing Driven by Target-Text Inversion
Authors: Yuming Qiao, Fanyi Wang, Jingwen Su, Yanhao Zhang, Yunjie Yu, Siyu Wu, Guo-Jun Qi
AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In order to demonstrate editing capability, effectiveness and efficiency of the proposed BARET, we have conducted extensive qualitative and quantitative experiments. Moreover, results derived from user study and ablation study further prove the superiority over other methods. |
| Researcher Affiliation | Collaboration | Yuming Qiao1,2, Fanyi Wang1*, Jingwen Su1, Yanhao Zhang1, Yunjie Yu1, Siyu Wu3, Guo-Jun Qi1,4 1OPPO Research Institute 2Tsinghua University 3Zhejiang University 4Westlake University |
| Pseudocode | Yes | Algorithm 1: Target-text inversion |
| Open Source Code | No | The paper does not provide an explicit statement or link for the open-source code of the described methodology. |
| Open Datasets | Yes | To this end, we refer to TEd Bench (Kawar et al. 2023) and collected 100 pairs of real image and textual description for editing. |
| Dataset Splits | No | The paper mentions collecting 100 pairs for a user study and describes tuning iterations, but does not provide specific train/validation/test dataset splits for model training. |
| Hardware Specification | Yes | The inversion stage of our method takes only about 16s on a single A100, which greatly improves the editing efficiency compared to methods that require fine-tuning diffusion model such as SINE and Imagic. |
| Software Dependencies | Yes | All experiments are based on stable diffusion v1.5 (Rombach et al. 2022), implemented DDIM sampling strategy with 50 steps and guidance scale 7.5. |
| Experiment Setup | Yes | All experiments are based on stable diffusion v1.5 (Rombach et al. 2022), implemented DDIM sampling strategy with 50 steps and guidance scale 7.5. For the target text inversion, loss function of fine-tuning target text embedding is MSE, tuning iterations are 250 in total and 5 iterations per step, and optimizer is Adam (Kingma and Ba 2015). Learning rate is 0.001. Progressive loss value is defined as {t 1e 5}T t=1 in each timestamp for early stop, such that reconstruction quality can be boosted with low threshold of loss at the early stage of denoising. And the threshold is gradually raised in the later stage. |