Fine-Grained Visual Prompting
Authors: Lingfeng Yang, Yueze Wang, Xiang Li, Xinlong Wang, Jian Yang
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our Fine-Grained Visual Prompting (FGVP) demonstrates superior performance in zero-shot comprehension of referring expressions on the Ref COCO, Ref COCO+, and Ref COCOg benchmarks. It outperforms prior methods by an average margin of 3.0% to 4.6%, with a maximum improvement of 12.5% on the Ref COCO+ test A subset. The part detection experiments conducted on the PACO dataset further validate the preponderance of FGVP over existing visual prompting techniques. In this section, we first evaluate individual visual prompting performance. Then, we compare FGVP with previous zero-shot methods on the referring expression comprehension and part detection tasks to show our effectiveness. |
| Researcher Affiliation | Academia | 1Nanjing University of Science and Technology 2Beijing Academy of Artificial Intelligence, 3Nankai University {yanglfnjust, csjyang}@njust.edu.cn, {yzwang, wangxinlong}@baai.ac.cn xiang.li.implus@nankai.edu.cn |
| Pseudocode | No | The paper describes methods in text and uses figures for illustration but does not include structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code is available at https://github.com/ylingfeng/FGVP. |
| Open Datasets | Yes | We conduct the experiments on several visual datasets, i.e., Ref COCO [63], Ref COCO+ [63], Ref COCOg [39], COCO [36], and PACO [44]. |
| Dataset Splits | Yes | Table 2: Ablation study on the zero-shot performance of individual visual prompting in the validation set of COCO, PACO, Ref COCO, Ref COCO+, and Ref COCOg datasets using ground truth annotations (left) and proposals in referring expression comprehension (right), respectively. Table 4: Accuracy of the part detection with Vi T-L on the validation set of each benchmark. |
| Hardware Specification | Yes | All experiments are conducted on 8 Tesla V100. Experiments are run on Ref COCO with a CLIP pre-trained Vi T-L/14@336px on 8 NVIDIA A100. |
| Software Dependencies | No | The paper mentions software like CLIP, SAM, Timm, and PyTorch (via a reference) but does not provide specific version numbers for these or other key software components used in the experiments. |
| Experiment Setup | Yes | Next, we ablate the standard deviation of the Gaussian blur kernel for blur-based prompting [ 4] (Fig. 5), and a value of 100 achieves the best. Notably, we set the grid size to 16 along one side of the image and used an NMS threshold of 0.7 by default. |