Visual Instruction Inversion: Image Editing via Image Prompting
Authors: Thao Nguyen, Yuheng Li, Utkarsh Ojha, Yong Jae Lee
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We compare our approach against both image-editing and visual prompting frameworks, on both synthetic and real images. In Section 4.2, we present qualitative results, followed by a quantitative comparison in Section 4.3. Both quantitative and qualitative results demonstrate that our approach not only achieves competitive performance to state-of-the-art models, but also has additional merits in specific cases. |
| Researcher Affiliation | Academia | Thao Nguyen Yuheng Li Utkarsh Ojha Yong Jae Lee University of Wisconsin-Madison |
| Pseudocode | Yes | Algorithm 1 Visual Instruction Inversion (VISII) |
| Open Source Code | No | The paper provides a project webpage (https://thaoshibe.github.io/visii/) but no explicit statement about open-sourcing the code or a direct link to a code repository. |
| Open Datasets | Yes | We randomly sampled images from the Clean-Instruct Pix2Pix dataset [4], which consists of synthetic paired before-after images with corresponding descriptions. |
| Dataset Splits | No | The paper mentions total image pair counts used for evaluation but does not specify explicit training, validation, and test splits with percentages or sample counts for reproducibility. |
| Hardware Specification | Yes | All experiments are conducted on a 4 NVIDIA RTX 3090 machine. |
| Software Dependencies | No | The paper mentions software components like 'pretrained clip-vit-large-patch14' and 'Instruct Pix2Pix' but does not provide specific version numbers for these or other software dependencies. |
| Experiment Setup | Yes | We use the frozen pretrained Instruct Pix2Pix [4] to optimize the instruction c T for N = 1000 steps, T = 1000 timesteps. We use Adam W optimizer [25] with learning rate γ = 0.001, λmse = 4, and λclip = 0.1. Text guidance and image guidance scores are set at their default value of 7.5 and 1.5, respectively. |