GroundVLP: Harnessing Zero-Shot Visual Grounding from Vision-Language Pre-training and Open-Vocabulary Object Detection

Authors: Haozhan Shen, Tiancheng Zhao, Mingwei Zhu, Jianwei Yin

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct main experiments on Ref COCO/+/g datasets for REC and Flickr30k Entities dataset for phrase grounding.
Researcher Affiliation Academia Haozhan Shen1, Tiancheng Zhao2*, Mingwei Zhu1, Jianwei Yin1 1Zhejiang University 2Binjiang Institute of Zhejiang University {hz shen, zhumw, zjuyjw}@zju.edu.cn, tianchez@zju-bj.com
Pseudocode No No explicit pseudocode or algorithm blocks were found in the paper.
Open Source Code Yes Our code is available at https://github.com/om-ai-lab/Ground VLP.
Open Datasets Yes We adopt three widely used datasets: Ref COCO, Ref COCO+ (Yu et al. 2016) and Ref COCOg (Mao et al. 2016). Ref COCO and Ref COCO+ are both split into validation, test A, and test B sets... We adopt Flickr30k entities dataset (Plummer et al. 2015) for the task...
Dataset Splits Yes Ref COCO and Ref COCO+ are both split into validation, test A, and test B sets, where test A generally contains queries with persons as referring targets and test B contains other types.
Hardware Specification No The paper does not provide specific details on the hardware used for experiments (e.g., GPU/CPU models, memory, or cloud instance types).
Software Dependencies No The paper mentions software tools and models like 'Stanza', 'CLIP', 'Detic', 'Vin VL', and 'ALBEF', but does not provide specific version numbers for any software dependencies or libraries.
Experiment Setup Yes For ALBEF, we use the 3rd layer of the cross-modality encoder for Grad CAM. For Vin VL, we use the 20th layer of the cross-modality encoder and select m = 7... For REC, we set α = 0.5, θ = 0.15 when using ground-truth category and θ = 0.3 for predicted category. For phrase grounding, we set α = 0.25 and θ = 0.15.