Improving Zero-Shot Phrase Grounding via Reasoning on External Knowledge and Spatial Relations

Authors: Zhan Shi, Yilin Shen, Hongxia Jin, Xiaodan Zhu2253-2261

AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We perform extensive experiments on different zero-shot grounding splits sub-sampled from the Flickr30K Entity and Visual Genome dataset, demonstrating that the proposed framework is orthogonal to backbone image encoders and outperforms the baselines by 2 3% in accuracy, resulting in a significant improvement under the standard evaluation metrics.
Researcher Affiliation Collaboration Zhan Shi* ,1 Yilin Shen,2 Hongxia Jin,2 Xiaodan Zhu1 1Ingenuity Labs Research Institute & ECE, Queen s University 2Samsung Research America
Pseudocode No The paper describes processes using mathematical equations and descriptive text, but does not include structured pseudocode or clearly labeled algorithm blocks.
Open Source Code No The paper mentions 'detectron2 (Wu et al. 2019). https://github.com/facebookresearch/detectron2' but this is a third-party tool used, not the authors' own source code for their methodology.
Open Datasets Yes Extensive experiments were performed on zero-shot phrase grounding splits introduced by (Sadhu, Chen, and Nevatia 2019), which were developed on Visual Genome (Krishna et al. 2017) and Flickr30K Entities (Plummer et al. 2015; Young et al. 2014).
Dataset Splits Yes Table 2: Dataset details, #i/#q means image/query numbers. Train Validation Test #i #q #i #q #i #q Flickr30K 30K 58K 1K 14K 1K 14K
Hardware Specification No The paper does not mention specific hardware used for training or inference, such as GPU or CPU models.
Software Dependencies No The paper lists various frameworks and libraries used (e.g., Glove, Bi-LSTM, SSD, VGG16, Retina Net, Resnet-50, detectron2, Adam) along with citations to their original papers, but does not provide specific version numbers for these software components.
Experiment Setup Yes The hyper-parameters λ1, λ2, β are set to be 1, 1 and 0.5, respectively. We perform graph convolution operations twice to get contextualized representation. Same as in (Sadhu, Chen, and Nevatia 2019), we start training by resizing the image to 300 * 300 for 10 epochs, and then we fine-tune the network with images being resized to 600 * 600 for 20 epochs using Adam (Kingma and Ba 2014) with a learning rate of 1e 4.