Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
GroundVLP: Harnessing Zero-Shot Visual Grounding from Vision-Language Pre-training and Open-Vocabulary Object Detection
Authors: Haozhan Shen, Tiancheng Zhao, Mingwei Zhu, Jianwei Yin
AAAI 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct main experiments on Ref COCO/+/g datasets for REC and Flickr30k Entities dataset for phrase grounding. |
| Researcher Affiliation | Academia | Haozhan Shen1, Tiancheng Zhao2*, Mingwei Zhu1, Jianwei Yin1 1Zhejiang University 2Binjiang Institute of Zhejiang University EMAIL, EMAIL |
| Pseudocode | No | No explicit pseudocode or algorithm blocks were found in the paper. |
| Open Source Code | Yes | Our code is available at https://github.com/om-ai-lab/Ground VLP. |
| Open Datasets | Yes | We adopt three widely used datasets: Ref COCO, Ref COCO+ (Yu et al. 2016) and Ref COCOg (Mao et al. 2016). Ref COCO and Ref COCO+ are both split into validation, test A, and test B sets... We adopt Flickr30k entities dataset (Plummer et al. 2015) for the task... |
| Dataset Splits | Yes | Ref COCO and Ref COCO+ are both split into validation, test A, and test B sets, where test A generally contains queries with persons as referring targets and test B contains other types. |
| Hardware Specification | No | The paper does not provide specific details on the hardware used for experiments (e.g., GPU/CPU models, memory, or cloud instance types). |
| Software Dependencies | No | The paper mentions software tools and models like 'Stanza', 'CLIP', 'Detic', 'Vin VL', and 'ALBEF', but does not provide specific version numbers for any software dependencies or libraries. |
| Experiment Setup | Yes | For ALBEF, we use the 3rd layer of the cross-modality encoder for Grad CAM. For Vin VL, we use the 20th layer of the cross-modality encoder and select m = 7... For REC, we set α = 0.5, θ = 0.15 when using ground-truth category and θ = 0.3 for predicted category. For phrase grounding, we set α = 0.25 and θ = 0.15. |