Fine-Grained Visual Prompting

Authors: Lingfeng Yang, Yueze Wang, Xiang Li, Xinlong Wang, Jian Yang

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our Fine-Grained Visual Prompting (FGVP) demonstrates superior performance in zero-shot comprehension of referring expressions on the Ref COCO, Ref COCO+, and Ref COCOg benchmarks. It outperforms prior methods by an average margin of 3.0% to 4.6%, with a maximum improvement of 12.5% on the Ref COCO+ test A subset. The part detection experiments conducted on the PACO dataset further validate the preponderance of FGVP over existing visual prompting techniques. In this section, we first evaluate individual visual prompting performance. Then, we compare FGVP with previous zero-shot methods on the referring expression comprehension and part detection tasks to show our effectiveness.
Researcher Affiliation Academia 1Nanjing University of Science and Technology 2Beijing Academy of Artificial Intelligence, 3Nankai University {yanglfnjust, csjyang}@njust.edu.cn, {yzwang, wangxinlong}@baai.ac.cn xiang.li.implus@nankai.edu.cn
Pseudocode No The paper describes methods in text and uses figures for illustration but does not include structured pseudocode or algorithm blocks.
Open Source Code Yes Code is available at https://github.com/ylingfeng/FGVP.
Open Datasets Yes We conduct the experiments on several visual datasets, i.e., Ref COCO [63], Ref COCO+ [63], Ref COCOg [39], COCO [36], and PACO [44].
Dataset Splits Yes Table 2: Ablation study on the zero-shot performance of individual visual prompting in the validation set of COCO, PACO, Ref COCO, Ref COCO+, and Ref COCOg datasets using ground truth annotations (left) and proposals in referring expression comprehension (right), respectively. Table 4: Accuracy of the part detection with Vi T-L on the validation set of each benchmark.
Hardware Specification Yes All experiments are conducted on 8 Tesla V100. Experiments are run on Ref COCO with a CLIP pre-trained Vi T-L/14@336px on 8 NVIDIA A100.
Software Dependencies No The paper mentions software like CLIP, SAM, Timm, and PyTorch (via a reference) but does not provide specific version numbers for these or other key software components used in the experiments.
Experiment Setup Yes Next, we ablate the standard deviation of the Gaussian blur kernel for blur-based prompting [ 4] (Fig. 5), and a value of 100 achieves the best. Notably, we set the grid size to 16 along one side of the image and used an NMS threshold of 0.7 by default.