Improving Zero-Shot Phrase Grounding via Reasoning on External Knowledge and Spatial Relations
Authors: Zhan Shi, Yilin Shen, Hongxia Jin, Xiaodan Zhu2253-2261
AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We perform extensive experiments on different zero-shot grounding splits sub-sampled from the Flickr30K Entity and Visual Genome dataset, demonstrating that the proposed framework is orthogonal to backbone image encoders and outperforms the baselines by 2 3% in accuracy, resulting in a significant improvement under the standard evaluation metrics. |
| Researcher Affiliation | Collaboration | Zhan Shi* ,1 Yilin Shen,2 Hongxia Jin,2 Xiaodan Zhu1 1Ingenuity Labs Research Institute & ECE, Queen s University 2Samsung Research America |
| Pseudocode | No | The paper describes processes using mathematical equations and descriptive text, but does not include structured pseudocode or clearly labeled algorithm blocks. |
| Open Source Code | No | The paper mentions 'detectron2 (Wu et al. 2019). https://github.com/facebookresearch/detectron2' but this is a third-party tool used, not the authors' own source code for their methodology. |
| Open Datasets | Yes | Extensive experiments were performed on zero-shot phrase grounding splits introduced by (Sadhu, Chen, and Nevatia 2019), which were developed on Visual Genome (Krishna et al. 2017) and Flickr30K Entities (Plummer et al. 2015; Young et al. 2014). |
| Dataset Splits | Yes | Table 2: Dataset details, #i/#q means image/query numbers. Train Validation Test #i #q #i #q #i #q Flickr30K 30K 58K 1K 14K 1K 14K |
| Hardware Specification | No | The paper does not mention specific hardware used for training or inference, such as GPU or CPU models. |
| Software Dependencies | No | The paper lists various frameworks and libraries used (e.g., Glove, Bi-LSTM, SSD, VGG16, Retina Net, Resnet-50, detectron2, Adam) along with citations to their original papers, but does not provide specific version numbers for these software components. |
| Experiment Setup | Yes | The hyper-parameters λ1, λ2, β are set to be 1, 1 and 0.5, respectively. We perform graph convolution operations twice to get contextualized representation. Same as in (Sadhu, Chen, and Nevatia 2019), we start training by resizing the image to 300 * 300 for 10 epochs, and then we fine-tune the network with images being resized to 600 * 600 for 20 epochs using Adam (Kingma and Ba 2014) with a learning rate of 1e 4. |