reproducibilityindex.ai

Improving Zero-Shot Phrase Grounding via Reasoning on External Knowledge and Spatial Relations

Authors: Zhan Shi, Yilin Shen, Hongxia Jin, Xiaodan Zhu2253-2261

AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We perform extensive experiments on different zero-shot grounding splits sub-sampled from the Flickr30K Entity and Visual Genome dataset, demonstrating that the proposed framework is orthogonal to backbone image encoders and outperforms the baselines by 2 3% in accuracy, resulting in a signiﬁcant improvement under the standard evaluation metrics.
Researcher Affiliation	Collaboration	Zhan Shi* ,1 Yilin Shen,2 Hongxia Jin,2 Xiaodan Zhu1 1Ingenuity Labs Research Institute & ECE, Queen s University 2Samsung Research America
Pseudocode	No	The paper describes processes using mathematical equations and descriptive text, but does not include structured pseudocode or clearly labeled algorithm blocks.
Open Source Code	No	The paper mentions 'detectron2 (Wu et al. 2019). https://github.com/facebookresearch/detectron2' but this is a third-party tool used, not the authors' own source code for their methodology.
Open Datasets	Yes	Extensive experiments were performed on zero-shot phrase grounding splits introduced by (Sadhu, Chen, and Nevatia 2019), which were developed on Visual Genome (Krishna et al. 2017) and Flickr30K Entities (Plummer et al. 2015; Young et al. 2014).
Dataset Splits	Yes	Table 2: Dataset details, #i/#q means image/query numbers. Train Validation Test #i #q #i #q #i #q Flickr30K 30K 58K 1K 14K 1K 14K
Hardware Specification	No	The paper does not mention specific hardware used for training or inference, such as GPU or CPU models.
Software Dependencies	No	The paper lists various frameworks and libraries used (e.g., Glove, Bi-LSTM, SSD, VGG16, Retina Net, Resnet-50, detectron2, Adam) along with citations to their original papers, but does not provide specific version numbers for these software components.
Experiment Setup	Yes	The hyper-parameters λ1, λ2, β are set to be 1, 1 and 0.5, respectively. We perform graph convolution operations twice to get contextualized representation. Same as in (Sadhu, Chen, and Nevatia 2019), we start training by resizing the image to 300 * 300 for 10 epochs, and then we ﬁne-tune the network with images being resized to 600 * 600 for 20 epochs using Adam (Kingma and Ba 2014) with a learning rate of 1e 4.