Visual Prompting via Image Inpainting

Authors: Amir Bar, Yossi Gandelsman, Trevor Darrell, Amir Globerson, Alexei Efros

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To study visual prompting, we pretrain different models (see Section 4.1) on Image Net and on the Figures dataset, then quantitatively evaluate the models using different prompts on simple downstream computer vision tasks (see Section 4.2). Using a synthetic dataset, we assess how the choice of model and data affect the success of visual prompting in Section 4.3, and explore different prompting design choices in Section 4.4. We provide a large variate of qualitative results both in this section as well as in the Supplementary Material.
Researcher Affiliation Collaboration Amir Bar 1,2, Yossi Gandelsman 1, Trevor Darrell1, Amir Globerson2,3, Alexei A. Efros1 1UC Berkeley 2Tel Aviv University 3Google Research
Pseudocode No No pseudocode or clearly labeled algorithm blocks were found.
Open Source Code No 1Project page: https://yossigandelsman.github.io/visual_prompt. The paper includes a project page URL but does not explicitly state that the source code for their methodology is provided at that link or elsewhere.
Open Datasets Yes For simplicity, we use a fixed Image Net pretrained VQGAN codebook. ... We train it on Image Net and our Figures dataset... The Computer Vision Figures (Figures) dataset consists of 88, 645 images that more closely resemble the structure of our visual prompts. The dataset was collected from Arxiv... We include a datasheet with more information in the Supplementary Material. ... [42] Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A.C., Fei-Fei, L.: Image Net Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV) 115(3), 211 252 (2015).
Dataset Splits Yes We randomly partitioned 90% of the data to train and left the rest for validation.
Hardware Specification Yes For training, we used a machine with 8 Quadro RTX 6000 GPUs, with a batch size of 48.
Software Dependencies No The paper does not provide specific software dependencies (e.g., programming language or library versions) used for replication. It mentions using a 'publicly available checkpoint from https://github.com/Comp Vis/taming-transformers' but this is a model checkpoint, not a software dependency.
Experiment Setup Yes All the models we describe are large transformer-based models [52, 13], with patch size 16 16, embedding dim 1024, 24 layers, and 16 heads. For training, we used a machine with 8 Quadro RTX 6000 GPUs, with a batch size of 48. The input image size is 224 224. ... We also pretrain another model for 1000 epochs on our dataset.