Referencing Where to Focus: Improving Visual Grounding with Referential Query

Authors: Yabing Wang, Zhuotao Tian, Qingpei Guo, Zheng Qin, Sanping Zhou, Ming Yang, Le Wang

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments demonstrate the effectiveness and efficiency of our proposed method, outperforming state-of-the-art approaches on five visual grounding benchmarks.
Researcher Affiliation Collaboration 1 National Key Laboratory of Human-Machine Hybrid Augmented Intelligence, National Engineering Research Center for Visual Information and Applications, and Institute of Artificial Intelligence and Robotics, Xi an Jiaotong University 2 Harbin Institute of Technology, Shenzhen, China 3 Ant Group
Pseudocode No The paper describes methods using equations and block diagrams but does not include structured pseudocode or an algorithm block.
Open Source Code Yes We submit our main code in the form of a zipped file in additional supplementary materials, and we will release the complete code after review.
Open Datasets Yes Ref COCO/Ref COCO+/Ref COCOg. Ref COCO [53] comprises 19,994 images featuring 50,000 referred objects, divided into train, val, test A, and test B sets. ... Flickr30K. Flickr30k Entities [33] ... Refer It Game. Refer It Game [18]
Dataset Splits Yes Ref COCO [53] comprises 19,994 images featuring 50,000 referred objects, divided into train, val, test A, and test B sets. ... We follow [7] to split the images into 29,783 for training, 1000 for validation, and 1000 for testing ... We follow [7] to split the dataset into train, validation and test sets, and report the performance on the test set.
Hardware Specification Yes The experiments are conducted on V100 GPUs.
Software Dependencies No The paper mentions using 'pre-trained CLIP' but does not specify software dependencies with version numbers (e.g., PyTorch version, Python version, specific library versions).
Experiment Setup Yes Following [7, 36], the resolution of the input image is resized to 640 × 640. We employ the pre-trained CLIP as our backbone to extract both image and language features, and we freeze its parameters during training. The model is optimized end-to-end using Adam W for 40 epochs, with a batch size of 32. We set the learning rate to 1e-4 and the weight decay to 1e-2. The loss weight λiou, λL1, λce, and λaux, we set to 3.0, 1.0, 1.0, and 0.1. For dense grounding, we set the parameters λfocal, and λdice to 5.0, and 1.0.