GroundVLP: Harnessing Zero-Shot Visual Grounding from Vision-Language Pre-training and Open-Vocabulary Object Detection
Authors: Haozhan Shen, Tiancheng Zhao, Mingwei Zhu, Jianwei Yin
AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct main experiments on Ref COCO/+/g datasets for REC and Flickr30k Entities dataset for phrase grounding. |
| Researcher Affiliation | Academia | Haozhan Shen1, Tiancheng Zhao2*, Mingwei Zhu1, Jianwei Yin1 1Zhejiang University 2Binjiang Institute of Zhejiang University {hz shen, zhumw, zjuyjw}@zju.edu.cn, tianchez@zju-bj.com |
| Pseudocode | No | No explicit pseudocode or algorithm blocks were found in the paper. |
| Open Source Code | Yes | Our code is available at https://github.com/om-ai-lab/Ground VLP. |
| Open Datasets | Yes | We adopt three widely used datasets: Ref COCO, Ref COCO+ (Yu et al. 2016) and Ref COCOg (Mao et al. 2016). Ref COCO and Ref COCO+ are both split into validation, test A, and test B sets... We adopt Flickr30k entities dataset (Plummer et al. 2015) for the task... |
| Dataset Splits | Yes | Ref COCO and Ref COCO+ are both split into validation, test A, and test B sets, where test A generally contains queries with persons as referring targets and test B contains other types. |
| Hardware Specification | No | The paper does not provide specific details on the hardware used for experiments (e.g., GPU/CPU models, memory, or cloud instance types). |
| Software Dependencies | No | The paper mentions software tools and models like 'Stanza', 'CLIP', 'Detic', 'Vin VL', and 'ALBEF', but does not provide specific version numbers for any software dependencies or libraries. |
| Experiment Setup | Yes | For ALBEF, we use the 3rd layer of the cross-modality encoder for Grad CAM. For Vin VL, we use the 20th layer of the cross-modality encoder and select m = 7... For REC, we set α = 0.5, θ = 0.15 when using ground-truth category and θ = 0.3 for predicted category. For phrase grounding, we set α = 0.25 and θ = 0.15. |