ImageBrush: Learning Visual In-Context Instructions for Exemplar-Based Image Manipulation
Authors: ya sheng sun, Yifan Yang, Houwen Peng, Yifei Shen, Yuqing Yang, Han Hu, Lili Qiu, Hideki Koike
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 4 Experiments 4.1 Experimental Settings Datasets. Our work leverages four widely used in-the-wild video datasets Scannet [12], LRW [10], UBCFashion [69], and DAVIS [41] as well as a synthetic dataset that involves numerous image editing operations. ... Evaluation Metrics. In the image manipulation task, it is crucial for the edited image to align with the intended direction while preserving the instruction-invariant elements in their original form. To assess the degree of agreement between the edited image and the provided instructions, we utilize a cosine similarity metric referred to as CLIP Direction Similarity in the CLIP space. ... Ablation Study. We performed ablation studies on three crucial components of our method, namely the diffusion process for context exploitation, the vision prompt design, and the injection of human-interest area. Specifically, we conducted experiments on our model by (1) replacing the diffusion process with masked image modeling, (2) removing the cross-attention feature injection obtained from vision prompt module, and (3) deactivating the input of the human interest region. |
| Researcher Affiliation | Collaboration | Yasheng Sun Yifan Yang* Houwen Peng Tokyo Institute of Technology Microsoft Microsoft sun.y.aj@m.titech.ac.jp yifanyang@microsoft.com houwen.peng@microsoft.com Yifei Shen Yuqing Yang Han Hu Microsoft Microsoft Microsoft yifeishen@microsoft.com yuqing.yang@microsoft.com hanhu@microsoft.com Lili Qiu Hideki Koike Microsoft Tokyo Institute of Technology liliqiu@microsoft.com koike@c.titech.ac.jp |
| Pseudocode | No | No pseudocode or clearly labeled algorithm blocks were found in the paper. |
| Open Source Code | No | The paper does not contain any explicit statement about releasing source code or a link to a code repository for the described methodology. |
| Open Datasets | Yes | Datasets. Our work leverages four widely used in-the-wild video datasets Scannet [12], LRW [10], UBCFashion [69], and DAVIS [41] as well as a synthetic dataset that involves numerous image editing operations. The Scannet [12] dataset ... The LRW [10] dataset ... The UBC-Fashion [69] dataset ... The DAVIS [41] (Densely Annotated VIdeo Segmentation) dataset ... |
| Dataset Splits | Yes | The Scannet [12] dataset ... of which 1,201 scenes lie in training split, 312 scenes are in the validation set. ... The LRW [10] dataset ... We adopt 80 percent of their test videos for training and 20 percent for evaluation. ... The DAVIS [41] ... It comprises a total of 150 videos, of which 90 are densely annotated for training and 60 for validation. |
| Hardware Specification | Yes | Our implementation utilizes Py Torch [39] and is trained on 24 Tesla V100-32G GPUs for 14K iterations using the Adam W [26] optimizer. |
| Software Dependencies | No | Our implementation utilizes Py Torch [39] and is trained on 24 Tesla V100-32G GPUs for 14K iterations using the Adam W [26] optimizer. ... During the training phase, we employ Grounding DINO [31] to label the focused region based on textual instruction. ... we extract keypoints from the UBC-Fashion dataset using Open Pose [8]... (Software names are mentioned, but specific version numbers are missing for PyTorch, AdamW, Grounding DINO, and Open Pose). |
| Experiment Setup | Yes | Implementation Details. In our approach, all input images have a size of 256 256 pixels and are concatenated as input to the UNet. ... During training, we set the classifier-free scale for the encoded instruction to 7.5 and the dropout ratio to 0.05. Our implementation utilizes Py Torch [39] and is trained on 24 Tesla V100-32G GPUs for 14K iterations using the Adam W [26] optimizer. The learning rate is set to 1e-6, and the batch size is set to 288. |