Referencing Where to Focus: Improving Visual Grounding with Referential Query
Authors: Yabing Wang, Zhuotao Tian, Qingpei Guo, Zheng Qin, Sanping Zhou, Ming Yang, Le Wang
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments demonstrate the effectiveness and efficiency of our proposed method, outperforming state-of-the-art approaches on five visual grounding benchmarks. |
| Researcher Affiliation | Collaboration | 1 National Key Laboratory of Human-Machine Hybrid Augmented Intelligence, National Engineering Research Center for Visual Information and Applications, and Institute of Artificial Intelligence and Robotics, Xi an Jiaotong University 2 Harbin Institute of Technology, Shenzhen, China 3 Ant Group |
| Pseudocode | No | The paper describes methods using equations and block diagrams but does not include structured pseudocode or an algorithm block. |
| Open Source Code | Yes | We submit our main code in the form of a zipped file in additional supplementary materials, and we will release the complete code after review. |
| Open Datasets | Yes | Ref COCO/Ref COCO+/Ref COCOg. Ref COCO [53] comprises 19,994 images featuring 50,000 referred objects, divided into train, val, test A, and test B sets. ... Flickr30K. Flickr30k Entities [33] ... Refer It Game. Refer It Game [18] |
| Dataset Splits | Yes | Ref COCO [53] comprises 19,994 images featuring 50,000 referred objects, divided into train, val, test A, and test B sets. ... We follow [7] to split the images into 29,783 for training, 1000 for validation, and 1000 for testing ... We follow [7] to split the dataset into train, validation and test sets, and report the performance on the test set. |
| Hardware Specification | Yes | The experiments are conducted on V100 GPUs. |
| Software Dependencies | No | The paper mentions using 'pre-trained CLIP' but does not specify software dependencies with version numbers (e.g., PyTorch version, Python version, specific library versions). |
| Experiment Setup | Yes | Following [7, 36], the resolution of the input image is resized to 640 × 640. We employ the pre-trained CLIP as our backbone to extract both image and language features, and we freeze its parameters during training. The model is optimized end-to-end using Adam W for 40 epochs, with a batch size of 32. We set the learning rate to 1e-4 and the weight decay to 1e-2. The loss weight λiou, λL1, λce, and λaux, we set to 3.0, 1.0, 1.0, and 0.1. For dense grounding, we set the parameters λfocal, and λdice to 5.0, and 1.0. |