Guiding Instruction-based Image Editing via Multimodal Large Language Models
Authors: Tsu-Jui Fu, Wenze Hu, Xianzhi Du, William Yang Wang, Yinfei Yang, Zhe Gan
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate various aspects of Photoshop-style modification, global photo optimization, and local editing. Extensive experimental results demonstrate that expressive instructions are crucial to instruction-based image editing, and our MGIE can lead to a notable improvement in automatic metrics and human evaluation while maintaining competitive inference efficiency. |
| Researcher Affiliation | Collaboration | Tsu-Jui Fu1, Wenze Hu2, Xianzhi Du2, William Yang Wang1, Yinfei Yang2, Zhe Gan2 1UC Santa Barbara, 2Apple |
| Pseudocode | Yes | Algorithm 1 MLLM-Guided Image Editing |
| Open Source Code | No | The paper mentions a "Project website: https://mllm-ie.github.io" but does not explicitly state that the source code for the methodology is available there, nor does it provide a direct link to a code repository. |
| Open Datasets | Yes | We use IPr2Pr (Brooks et al., 2023) as our pre-training data. [...] For a comprehensive evaluation, we consider various editing aspects. EVR (Tan et al., 2019), GIER (Shi et al., 2020), MA5k (Shi et al., 2022), and Magic Brush (Zhang et al., 2023a). |
| Dataset Splits | Yes | We treat the same training/validation/testing split as the original settings. |
| Hardware Specification | Yes | All experiments are conducted in Py Torch (Paszke et al., 2017) on 8 A100 GPUs. |
| Software Dependencies | No | The paper states that experiments are conducted in "Py Torch (Paszke et al., 2017)" but does not specify a version number for PyTorch or any other software dependencies. |
| Experiment Setup | Yes | The learning rates of the MLLM and F are 5e-4 and 1e-4, respectively. All experiments are conducted in Py Torch (Paszke et al., 2017) on 8 A100 GPUs. We adopt Adam W (Loshchilov & Hutter, 2019) with the batch size of 128 to optimize MGIE. [...] During inference, we use V = 1.5 and X = 7.5. |